terrierteam / pyterrier_colbert

82 stars 35 forks source link

索引在写入ivfpq.100.faiss #70

Open talk2much opened 4 months ago

talk2much commented 4 months ago

Hello. Now I get the most of the need when the index files such as doclens.10.json,docnos.pkl.gz files But in the last step write in ivfpq.100.faiss file failed So I want to use the obtained file to write in the ivfpq.100.faiss file. My code is as follows:

indexer = ColBERTIndexer(checkpoint, "/home/yujy/code/Colbert_PRF/index", "robust04_index",skip_empty_docs=True,chunksize=6,ids=True)
# dataset = pt.get_dataset("trec-deep-learning-passages")
dataset = pt.get_dataset('irds:disks45/nocr/trec-robust-2004')
indexer.index02(dataset.get_corpus_iter())
    def index02(self, iterator):
        docnos = []
        docid = 0
        def convert_gen(iterator):
            import pyterrier as pt
            nonlocal docnos
            nonlocal docid
            if self.num_docs is not None:
                iterator = pt.tqdm(iterator, total=self.num_docs, desc="encoding", unit="d")
            for l in iterator:
                l["docid"] = docid
                docnos.append(l['docno'])
                docid += 1
                yield l
        self.args.generator = convert_gen(iterator)
        index_faiss(self.args)
        print("#> Faiss encoding complete")

But it didn't work out and got stuck:

[ 21:30:28] #> Indexing the vectors...
[ 21:30:28] #> Loading ('/home/yujy/code/Colbert_PRF/index/robust04_index/0.pt', '/home/yujy/code/Colbert_PRF/index/robust04_index/1.pt', '/home/yujy/code/Colbert_PRF/index/robust04_index/2.pt') (from queue)...
talk2much commented 4 months ago

1 I have all the files I need for the index except ivfpq.100.faiss

Xiao0728 commented 4 months ago

Hi, It seems the generated files also don't have the ivfpq.faiss file? so, this means the indexing process didn't succeed. Normally, for creating the Robust04 ColBERT& ColBERT-PRF indices, we consider splitting the long documents of Robust04 into smaller passages. Maybe you can try the following code:

from pyterrier_colbert.indexing import ColBERTIndexer
index_root="/some/path"
index_name="index_name"

# default 150, stride 75
indexer =  pt.text.sliding(text_attr="text", prepend_title=False)>> ColBERTIndexer("/path/to/colbert.dnn",index_root, index_name, chunksize=20)
indexer.index(pt.get_dataset("irds:trec-robust04").get_corpus_iter())
talk2much commented 4 months ago

Hi, It seems the generated files also don't have the ivfpq.faiss file? so, this means the indexing process didn't succeed. Normally, for creating the Robust04 ColBERT& ColBERT-PRF indices, we consider splitting the long documents of Robust04 into smaller passages. Maybe you can try the following code:

from pyterrier_colbert.indexing import ColBERTIndexer
index_root="/some/path"
index_name="index_name"

# default 150, stride 75
indexer =  pt.text.sliding(text_attr="text", prepend_title=False)>> ColBERTIndexer("/path/to/colbert.dnn",index_root, index_name, chunksize=20)
indexer.index(pt.get_dataset("irds:trec-robust04").get_corpus_iter())

I get an error when I don't change the code

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/site-packages/colbert/indexing/faiss.py", line 92, in _loader_thread
    sub_collection = [load_index_part(filename) for filename in filenames if filename is not None]
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/site-packages/colbert/indexing/faiss.py", line 92, in <listcomp>
    sub_collection = [load_index_part(filename) for filename in filenames if filename is not None]
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/site-packages/colbert/indexing/index_manager.py", line 17, in load_index_part
    part = torch.load(filename)
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/home/yujy/anaconda3/envs/colbert/lib/python3.8/site-packages/torch/serialization.py", line 457, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: not a ZIP archive

It looks like the thread won't start, right

talk2much commented 4 months ago

The location of this error is

[4月 25, 10:52:31] #> Indexing the vectors...
[4月 25, 10:52:31] #> Loading ('/home/yujy/code/Colbert_PRF/index/robust04_index/0.pt', '/home/yujy/code/Colbert_PRF/index/robust04_index/1.pt', '/home/yujy/code/Colbert_PRF/index/robust04_index/2.pt') (from queue)...
Exception in thread Thread-2:
Traceback (most recent call last):
talk2much commented 4 months ago

Hi, It seems the generated files also don't have the ivfpq.faiss file? so, this means the indexing process didn't succeed. Normally, for creating the Robust04 ColBERT& ColBERT-PRF indices, we consider splitting the long documents of Robust04 into smaller passages. Maybe you can try the following code:

from pyterrier_colbert.indexing import ColBERTIndexer
index_root="/some/path"
index_name="index_name"

# default 150, stride 75
indexer =  pt.text.sliding(text_attr="text", prepend_title=False)>> ColBERTIndexer("/path/to/colbert.dnn",index_root, index_name, chunksize=20)
indexer.index(pt.get_dataset("irds:trec-robust04").get_corpus_iter())

After debugging I found that the reason for the error is that all my pt files torch are not loading, so it will report an error: RuntimeError: PytorchStreamReader failed reading zip archive: not a ZIP archive

cmacdonald commented 4 months ago

And what does Google say when you search for this error message?