togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Running full pipeline on a small part of CC #103

Open zhentingqi opened 4 months ago

zhentingqi commented 4 months ago

Hi! Can anyone please tell me how to run the full mining pipeline using cc_net on just a very small portion of CC? E.g., I just want to around 100M cleaned data of the newest crawl 2023-50. Thanks!