togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Request: Enable artifact prep on local data #83

Closed hicotton02 closed 7 months ago

hicotton02 commented 7 months ago

@mauriceweber Opening an issue as requested to enable artifiact prep on local ccnet data instead of S3 bucket.

mauriceweber commented 7 months ago

@hicotton02 you should be able to run (in the local-data branch) artifacts prep from local data now using :

python3 app/src/prep_artifacts.py \
  --artifacts-dir /path/to/artifacts \
  --cc_input /path/to/cc/listings.txt
  --cc_input_base_uri file:///path/to/cc/data/root \
  --lang LANG \
  --dsir_num_samples DSIR_SAMPLES \
  --classifiers_num_samples CLASSIFIERS_SAMPLES \
  --max_samples_per_book 1000 \
  --max_paragraphs_per_book_sample 250