microsoft / msmarco

website for MS Marco
https://microsoft.github.io/msmarco/.
Creative Commons Attribution 4.0 International
27 stars 15 forks source link

Downloading msmarco_v2_doc.tar #7

Closed seanmacavaney closed 3 years ago

seanmacavaney commented 3 years ago

We're really excited that the v2 document corpus is now available! A couple of questions:

sebastian-hofstaetter commented 3 years ago

Here in Vienna, Austria we have the same problem: ~4MB/s download speed to a university server and the download gets repeatedly interrupted and restarts from the beginning (using wget). So far we haven't been able to download the file.

bmitra-msft commented 3 years ago

For downloading the larger files, we recommend using AzCopy. https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10 https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-blobs-download

Can you please let us know if that resolves the issue?

seanmacavaney commented 3 years ago

azcopy worked like a charm this time, thanks! Maybe the file just needed time to replicate, because when I first tried azcopy the other week, it was just as slow as curl.

sebastian-hofstaetter commented 3 years ago

Thanks Bhaskar! Can also confirm azcopy worked in 5 minutes for the 30GB :) Command: azcopy copy https://msmarco.blob.core.windows.net/msmarcoranking/msmarco_v2_doc.tar msmarco_v2_doc.tar

seanmacavaney commented 3 years ago

tl;dr: if you do not want to use azcopy, you can improve the reliability of the download by setting the X-Ms-Version: 2019-12-12 header. Example:

wget --header "X-Ms-Version: 2019-12-12" https://msmarco.blob.core.windows.net/msmarcoranking/msmarco_v2_doc.tar

Details:

I dug a bit into this, here's what I've found: azcopy issues multiple HTTP requests for different parts of the file at once, which is why it can end up being faster than issuing a single request. To facilitate this, it sets a variety of HTTP request headers, but the most important one seems to be X-Ms-Version: 2019-12-12. When this header is present, the server will accept HTTP range requests, meaning downloads will be able to resume from where they left off if they are interrupted.