terrierteam / pyterrier_colbert

83 stars 35 forks source link

dataset = pt.get_dataset("trec-deep-learning-passages") Data set not found Error 404 #69

Closed talk2much closed 6 months ago

talk2much commented 6 months ago

When I was recreating the colbert prf demo, the indexer.index(dataset.get_corpus_iter()) error 404 not found in the index generation step of MSMARCO passage ranking corpus Change dataset = pt.get_dataset("trec-deep-learning-passages") to dataset = pt.get_dataset("msmarco_passages") still cannot find the collection in the Passages error 404

talk2much commented 6 months ago

The specific error is as follows: The specified resource does not exist. for url: https://msmarco.blob.core.windows.net/msmarcoranking/collection.tar.gz

cmacdonald commented 6 months ago

MSMARCO moved their URLs. Install PyTerrier from github, which has the fixed URL: pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier

IR datasets has the latest URL too: dataset = pt.get_dataset("irds:msmarco-passage")

talk2much commented 6 months ago

MSMarcO 移动了他们的 URL。 从 github 安装 PyTerrier,它有固定的 URL: pip install --upgrade git+https://github.com/terrier-org/pyterrier.git#egg=python-terrier

IR数据集也有最新的网址: dataset = pt.get_dataset("irds:msmarco-passage")

384/5000 Thank you for your reply. The issue with the msmarco documentation set has been resolved. However, when I tried to index The robust04 document set, dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004') would report an error: The TREC Robust document collection is from TREC disks 4 and 5. Due to the copyrighted nature of the document s, this collection is for research use only, which requires agreements to be filed with NIST. How do I index robust04? Or in fact I already have a robust04 document set on my device. Can I use my local files to generate an index? Thank you very much

cmacdonald commented 6 months ago

You need to symlink the robust corpus into your IRDS folder. IRDS should give an explanation in its error message

cmacdonald commented 6 months ago

See instructions at https://ir-datasets.com/disks45.html#disks45/nocr/trec-robust-2004

talk2much commented 6 months ago

Hello. Now I get the most of the need when the index files such as doclens.10.json,docnos.pkl.gz files But in the last step write in ivfpq.100.faiss file failed So I want to use the obtained file to write in the ivfpq.100.faiss file. 1

My code is as follows:

indexer = ColBERTIndexer(checkpoint, "/home/yujy/code/Colbert_PRF/index", "robust04_index",skip_empty_docs=True,chunksize=6,ids=True)
# dataset = pt.get_dataset("trec-deep-learning-passages")
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
indexer.index02(dataset.get_corpus_iter())
def index02(self, iterator):
        docnos = []
        docid = 0
        def convert_gen(iterator):
            import pyterrier as pt
            nonlocal docnos
            nonlocal docid
            if self.num_docs is not None:
                iterator = pt.tqdm(iterator, total=self.num_docs, desc="encoding", unit="d")
            for l in iterator:
                l["docid"] = docid
                docnos.append(l['docno'])
                docid += 1
                yield l
        self.args.generator = convert_gen(iterator)
        index_faiss(self.args)
        print("#> Faiss encoding complete")

But it didn't work out and got stuck:

[ 21:30:28] #> Indexing the vectors...
[ 21:30:28] #> Loading ('/home/yujy/code/Colbert_PRF/index/robust04_index/0.pt', '/home/yujy/code/Colbert_PRF/index/robust04_index/1.pt', '/home/yujy/code/Colbert_PRF/index/robust04_index/2.pt') (from queue)...

what should i do