Closed geoffreydstewart closed 1 year ago
Hi @geoffreydstewart, the script is here: https://github.com/togethercomputer/RedPajama-Data/blob/main/data_prep/github/scripts/github-prepare-download.sh
The script gets a list of the files in your gcp bucket and splits these into different chunks so that you can download the chunks in parallel.
Yes, it most certainly is! I'm not sure why that wasn't showing up in my IDE, but it's there on my file system. Sorry for the noise.
cool!:)
Thanks for your great work on this project! As mentioned in #25 The script
scripts/github-prepare-download.sh
which is referenced in this README.md is not present in the repository. Should the file be added, or is the README.md incorrect? Thanks!