togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.53k stars 346 forks source link

Specifying arxiv dates #71

Open matthieumeeus opened 1 year ago

matthieumeeus commented 1 year ago

Hi there,

Thanks for making this code available. I am trying to use the arxiv downloader, but would be interested in a certain date range of papers to be downloaded. Any tips on how to approach this?

Many thanks

mauriceweber commented 1 year ago

Hi @matthieumeeus

The arxiv data on their S3 bucket follows the format arXiv_src_<month>_<chunk>.tar (e.g., arXiv_src_1206_004.tar corresponds to the chunk 4 of the month 2012-06). If months-level granularity is fine enough you can run e.g.

python run_download.py --aws_config aws_config.ini --workers 1 --target_dir $DATA_DIR --setup

which will produce a file with all the listings in it. You can either filter this generated file, or directly change to code here to only include the yymm tags that you are interested in.

I hope this helps!