Closed lipingtang17 closed 5 months ago
Hi @lipingtang17 I recommed using aria2c to parallelize the downloads. To use aria2c, you can do the following (this also applies to the other components of RPv2):
# download urls
wget "https://data.together.xyz/redpajama-data-v2/v1.0.0/urls/document-urls.txt" -O "document-urls.txt"
Then you need to change the file so that it can be used with aria2c:
while IFS= read -r line; do
echo "$line"
echo " out=$(echo "$line" | sed 's|.*/v1.0.0/||')" | sed 's|.parquet|&\t|'
done < document-urls.txt > aria2c-document-urls.txt
And run the parallel download:
aria2c --input-file aria2c-document-urls.txt
Dear RedPajama Team,
I wanted to express my gratitude for your efforts in releasing the dataset. Currently, I am in the process of downloading your dataset using "wget", as outlined on your Hugging Face datasets page, to my s3 bucket. I've noticed that the download speed is not high, i.e., averaging around 20-30MB/s.
I am reaching out to inquire if there are any recommendations or methods you could suggest to accelerate the download speed.
Thank you in advance for your assistance.