togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.57k stars 350 forks source link

Add missing script or update README.md #27

Closed geoffreydstewart closed 1 year ago

geoffreydstewart commented 1 year ago

Thanks for your great work on this project! As mentioned in #25 The script scripts/github-prepare-download.sh which is referenced in this README.md is not present in the repository. Should the file be added, or is the README.md incorrect? Thanks!

mauriceweber commented 1 year ago

Hi @geoffreydstewart, the script is here: https://github.com/togethercomputer/RedPajama-Data/blob/main/data_prep/github/scripts/github-prepare-download.sh

The script gets a list of the files in your gcp bucket and splits these into different chunks so that you can download the chunks in parallel.

geoffreydstewart commented 1 year ago

Yes, it most certainly is! I'm not sure why that wasn't showing up in my IDE, but it's there on my file system. Sorry for the noise.

mauriceweber commented 1 year ago

cool!:)