Low Data Downloading Speed

togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.

Apache License 2.0

4.43k stars 335 forks source link

Hi @lipingtang17 I recommed using aria2c to parallelize the downloads. To use aria2c, you can do the following (this also applies to the other components of RPv2):

# download urls
wget "https://data.together.xyz/redpajama-data-v2/v1.0.0/urls/document-urls.txt" -O "document-urls.txt"

Then you need to change the file so that it can be used with aria2c:

while IFS= read -r line; do
    echo "$line"
    echo " out=$(echo "$line" | sed 's|.*/v1.0.0/||')" | sed 's|.parquet|&\t|'
done < document-urls.txt > aria2c-document-urls.txt

And run the parallel download:

aria2c --input-file aria2c-document-urls.txt

togethercomputer / RedPajama-Data

Low Data Downloading Speed #89