togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Low Data Downloading Speed #89

Closed lipingtang17 closed 5 months ago

lipingtang17 commented 7 months ago

Dear RedPajama Team,

I wanted to express my gratitude for your efforts in releasing the dataset. Currently, I am in the process of downloading your dataset using "wget", as outlined on your Hugging Face datasets page, to my s3 bucket. I've noticed that the download speed is not high, i.e., averaging around 20-30MB/s.

I am reaching out to inquire if there are any recommendations or methods you could suggest to accelerate the download speed.

Thank you in advance for your assistance.

mauriceweber commented 7 months ago

Hi @lipingtang17 I recommed using aria2c to parallelize the downloads. To use aria2c, you can do the following (this also applies to the other components of RPv2):

# download urls
wget "https://data.together.xyz/redpajama-data-v2/v1.0.0/urls/document-urls.txt" -O "document-urls.txt"

Then you need to change the file so that it can be used with aria2c:

while IFS= read -r line; do
    echo "$line"
    echo " out=$(echo "$line" | sed 's|.*/v1.0.0/||')" | sed 's|.parquet|&\t|'
done < document-urls.txt > aria2c-document-urls.txt

And run the parallel download:

aria2c --input-file aria2c-document-urls.txt