ma787639046 / bowdpr

Codebase for [Paper] Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval
Apache License 2.0
12 stars 1 forks source link

not able to download ms-marco dataset #1

Open seganrasan opened 7 months ago

seganrasan commented 7 months ago

when i run get_data.sh, i get below error.

./get_data.sh --2024-02-05 15:14:36-- https://rocketqa.bj.bcebos.com/corpus/marco.tar.gz Resolving rocketqa.bj.bcebos.com (rocketqa.bj.bcebos.com)... 103.235.46.61, 2409:8c04:1001:1002:0:ff:b001:368a Connecting to rocketqa.bj.bcebos.com (rocketqa.bj.bcebos.com)|103.235.46.61|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1140510742 (1.1G) [application/x-gzip] Saving to: ‘marco.tar.gz’

marco.tar.gz 100%[===================================================>] 1.06G 4.27MB/s in 5m 11s

2024-02-05 15:19:50 (3.50 MB/s) - ‘marco.tar.gz’ saved [1140510742/1140510742]

--2024-02-05 15:20:24-- https://msmarco.blob.core.windows.net/msmarcoranking/qidpidtriples.train.full.2.tsv.gz Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 20.150.34.4 Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|20.150.34.4|:443... connected. HTTP request sent, awaiting response... 404 The specified resource does not exist. 2024-02-05 15:20:29 ERROR 404: The specified resource does not exist..

--2024-02-05 15:20:29-- https://msmarco.blob.core.windows.net/msmarcoranking/qrels.train.tsv Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 20.150.34.4 Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|20.150.34.4|:443... connected. HTTP request sent, awaiting response... 404 The specified resource does not exist. 2024-02-05 15:20:29 ERROR 404: The specified resource does not exist..

gzip: qidpidtriples.train.full.2.tsv.gz: No such file or directory

ma787639046 commented 7 months ago

Hi! It seems that the download URL hosted by Microsoft has changed. You can refer to this link for the latest download links.

Now these links are updated to: https://msmarco.z22.web.core.windows.net/msmarcoranking/qidpidtriples.train.full.2.tsv.gz https://msmarco.z22.web.core.windows.net/msmarcoranking/qrels.train.tsv

I will change the links in get_data.sh to latest versions.