togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.53k stars 346 forks source link

Script fixes in data_prep/github #25

Closed geoffreydstewart closed 1 year ago

geoffreydstewart commented 1 year ago

First of all, thank you for your great work to create this project. I didn't have access to a Slurm workload manager, but I was able to use these scripts to preprocess a sample of the GitHub dataset from BigQuery (which was exactly what I wanted to do!). Here are a couple points which would improve the scripts for the next person:

Thanks again for your work on this project.

mauriceweber commented 1 year ago

Thanks a lot for pointing this out!

I fixed the TARGET_DIR variables in 43960c7.