Open merokaa opened 7 months ago
Hi @merokaa
Thanks for kind words about PyTerrier etc, we're glad you like it.
MSMARCO v2 passage is quite large for ColBERT. We once had a kinda distributed fork of pyterrier_colbert that was a bit more suited to this, but we didnt integrate it.
There is now ColBERT v2, which uses a much more compact index representation, and may be more suited for MSMARCOv2. We havent ported pyterrier_colbert to that - but assistance would be appreciated.
Sticking on the current ColBERT/pyterrier_colbert implementation, one other suggestion is simply splitting the corpus into chunks, and indexing those separately.
Subject then to enough memory (or using the mmap option for loading the indices - but that makes things pretty slow), you could combine the retrieval transformers using the +
operator:
# indexing, roughly
ColBERTIndexer("index1").index(pt.get_dataset('irds:msmarco-passage').get_corpus_iter()[0:3_000_000] )
ColBERTIndexer("index2").index(pt.get_dataset('irds:msmarco-passage').get_corpus_iter()[3_000_000:6_000_000] )
# etc
# retrieval, roughly
colbert_slice_1 = Factory("index1").end_to_end()
colbert_slice_2 = Factory("index2").end_to_end()
colbert_slice_3 = Factory("index3").end_to_end()
combined = (colbert_slice_1 + colbert_slice_2 + colbert_slice_3) % 1000
This works because, I think, IRDS iterators are sliceable.
Note that ColBERT indices can also be pruned - see https://github.com/cmacdonald/colbert_static_pruning - we were able to throw away 55% of token embeddings in MSMARCO v1 without significant impact on effectiveness.
@seanmacavaney anything I missed?
Thank you so much @cmacdonald. That is very useful to know. I am now having some faiss issues when the data starts to be added to the index; it works fine on Colab but not on my machine :(. I'll get back to you once I resolve it.
I have recently switched to Pyterrier and have been pleased with the dense retrieval plug-ins. Thank you very much for creating the wonderful framework.
Can I tweet this? It's a wonderful quote!
I have recently switched to Pyterrier and have been pleased with the dense retrieval plug-ins. Thank you very much for creating the wonderful framework.
Can I tweet this? It's a wonderful quote!
Sure, please do! :)
Hi,
I have recently switched to Pyterrier and have been pleased with the dense retrieval plug-ins. Thank you very much for creating the wonderful framework.
I have been going through the pain of building a ColBERT dense index for MSMARCO passages v2 where the process took a long time and crashed due to some technical issues halfway through.
I wonder if there is a built-in support to resume building the index; if not I would appreciate any tips about doing that while keeping the integrity of the built index. Thank you.