Open axelmagn opened 3 months ago
1MB/s sounds extremely slow -- how many connections/requests per second are you sending to our endpoint? We do have throttling mechanisms if too many requests are made. Do you see any 429 errors on your end?
Unfortunately because I am using GCP transfer jobs, I don't know the exact number for concurrent connections. However I am running 180 concurrent jobs, and they may be sharing IP addresses between them. No 429 errors have been reported.
The throughput has been quite variable, and recovered quite a bit since the time of posting:
Is this dataset hosted through a single server, or is it distributed across nodes in any way?
@mauriceweber can you comment at all on the hosting architecture, or the most efficient way to initiate file transfers? Are these files hosted on a cloud storage solution like GCS, s3, or cloudfront? Are they hosted on a single larger machine? My previous transfer jobs were not successful, and I'll need to start a new xfer job this week. Knowing how these are hosted will help me form a reasonable estimate for how long the jobs should take, and inform which transfer method I choose.
Hi @axelmagn, apologies for the late answer! The files are hosted on cloud storage, only publicly accessible via http and requests are rate limited -- it is your responsibility to ensure a limit on the number of requests in order not to get rate limited. We are looking into other solutions for more large scale downloads, so it is more convenient to access the full dataset.
No worries and thanks for the reply.
Edit: how many is too many?
I am working on ingesting the RPV2 dataset onto GCS buckets using GCP storage transfer jobs. Speeds seem to be incredibly slow (on the order of 100KB/s - 1MB/s), and at this rate it will take on the order of weeks to transfer the files. There's still a possibility that the bottleneck is on my end, but more and more it's looking like the host is either throttling connections or overloaded on I/O.
Can you shed any light on how this dataset is hosted, or what the best transfer methods would be at scale? I've already prototyped out a small pipeline on sampled data, and would like to scale it up in a reasonable timeframe.