togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

slow transfer speeds from URL sources #113

Open axelmagn opened 3 months ago

axelmagn commented 3 months ago

I am working on ingesting the RPV2 dataset onto GCS buckets using GCP storage transfer jobs. Speeds seem to be incredibly slow (on the order of 100KB/s - 1MB/s), and at this rate it will take on the order of weeks to transfer the files. There's still a possibility that the bottleneck is on my end, but more and more it's looking like the host is either throttling connections or overloaded on I/O.

Can you shed any light on how this dataset is hosted, or what the best transfer methods would be at scale? I've already prototyped out a small pipeline on sampled data, and would like to scale it up in a reasonable timeframe.

mauriceweber commented 2 months ago

1MB/s sounds extremely slow -- how many connections/requests per second are you sending to our endpoint? We do have throttling mechanisms if too many requests are made. Do you see any 429 errors on your end?

axelmagn commented 2 months ago

Unfortunately because I am using GCP transfer jobs, I don't know the exact number for concurrent connections. However I am running 180 concurrent jobs, and they may be sharing IP addresses between them. No 429 errors have been reported.

The throughput has been quite variable, and recovered quite a bit since the time of posting: image

Is this dataset hosted through a single server, or is it distributed across nodes in any way?

axelmagn commented 2 months ago

@mauriceweber can you comment at all on the hosting architecture, or the most efficient way to initiate file transfers? Are these files hosted on a cloud storage solution like GCS, s3, or cloudfront? Are they hosted on a single larger machine? My previous transfer jobs were not successful, and I'll need to start a new xfer job this week. Knowing how these are hosted will help me form a reasonable estimate for how long the jobs should take, and inform which transfer method I choose.

mauriceweber commented 2 months ago

Hi @axelmagn, apologies for the late answer! The files are hosted on cloud storage, only publicly accessible via http and requests are rate limited -- it is your responsibility to ensure a limit on the number of requests in order not to get rate limited. We are looking into other solutions for more large scale downloads, so it is more convenient to access the full dataset.

axelmagn commented 2 months ago

No worries and thanks for the reply.

Edit: how many is too many?