Closed talk2much closed 6 months ago
The specific error is as follows: The specified resource does not exist. for url: https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz
MSMARCO moved their URLs.
Install PyTerrier from github, which has the fixed URL:
pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier
IR datasets has the latest URL too:
dataset = pt.get_dataset("irds:msmarco-passage")
MSMarcO 移动了他们的 URL。 从 github 安装 PyTerrier,它有固定的 URL:
pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier
IR数据集也有最新的网址:
dataset = pt.get_dataset("irds:msmarco-passage")
384/5000 Thank you for your reply. The issue with the msmarco documentation set has been resolved. However, when I tried to index The robust04 document set, dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004') would report an error: The TREC Robust document collection is from TREC disks 4 and 5. Due to the copyrighted nature of the document s, this collection is for research use only, which requires agreements to be filed with NIST. How do I index robust04? Or in fact I already have a robust04 document set on my device. Can I use my local files to generate an index? Thank you very much
You need to symlink the robust corpus into your IRDS folder. IRDS should give an explanation in its error message
See instructions at https://ir-datasets.com/disks45.html#disks45/nocr/trec-robust-2004
Hello. Now I get the most of the need when the index files such as doclens.10.json,docnos.pkl.gz files But in the last step write in ivfpq.100.faiss file failed So I want to use the obtained file to write in the ivfpq.100.faiss file.
My code is as follows:
indexer = ColBERTIndexer(checkpoint, "/home/yujy/code/Colbert_PRF/index", "robust04_index",skip_empty_docs=True,chunksize=6,ids=True)
# dataset = pt.get_dataset("trec-deep-learning-passages")
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
indexer.index02(dataset.get_corpus_iter())
def index02(self, iterator):
docnos = []
docid = 0
def convert_gen(iterator):
import pyterrier as pt
nonlocal docnos
nonlocal docid
if self.num_docs is not None:
iterator = pt.tqdm(iterator, total=self.num_docs, desc="encoding", unit="d")
for l in iterator:
l["docid"] = docid
docnos.append(l['docno'])
docid += 1
yield l
self.args.generator = convert_gen(iterator)
index_faiss(self.args)
print("#> Faiss encoding complete")
But it didn't work out and got stuck:
[ 21:30:28] #> Indexing the vectors...
[ 21:30:28] #> Loading ('/home/yujy/code/Colbert_PRF/index/robust04_index/0.pt', '/home/yujy/code/Colbert_PRF/index/robust04_index/1.pt', '/home/yujy/code/Colbert_PRF/index/robust04_index/2.pt') (from queue)...
what should i do
When I was recreating the colbert prf demo, the indexer.index(dataset.get_corpus_iter()) error 404 not found in the index generation step of MSMARCO passage ranking corpus Change dataset = pt.get_dataset("trec-deep-learning-passages") to dataset = pt.get_dataset("msmarco_passages") still cannot find the collection in the Passages error 404