Closed seanmacavaney closed 3 years ago
Here in Vienna, Austria we have the same problem: ~4MB/s download speed to a university server and the download gets repeatedly interrupted and restarts from the beginning (using wget). So far we haven't been able to download the file.
For downloading the larger files, we recommend using AzCopy. https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10 https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-blobs-download
Can you please let us know if that resolves the issue?
azcopy worked like a charm this time, thanks! Maybe the file just needed time to replicate, because when I first tried azcopy the other week, it was just as slow as curl.
Thanks Bhaskar! Can also confirm azcopy worked in 5 minutes for the 30GB :) Command:
azcopy copy https://msmarco.blob.core.windows.net/msmarcoranking/msmarco_v2_doc.tar msmarco_v2_doc.tar
tl;dr: if you do not want to use azcopy, you can improve the reliability of the download by setting the X-Ms-Version: 2019-12-12
header. Example:
wget --header "X-Ms-Version: 2019-12-12" https://msmarco.blob.core.windows.net/msmarcoranking/msmarco_v2_doc.tar
Details:
I dug a bit into this, here's what I've found: azcopy issues multiple HTTP requests for different parts of the file at once, which is why it can end up being faster than issuing a single request. To facilitate this, it sets a variety of HTTP request headers, but the most important one seems to be X-Ms-Version: 2019-12-12
. When this header is present, the server will accept HTTP range requests, meaning downloads will be able to resume from where they left off if they are interrupted.
We're really excited that the v2 document corpus is now available! A couple of questions:
azcopy
didn't make a difference.