terrierteam / pyterrier_colbert

81 stars 35 forks source link

Resume building the index when the process crashes #65

Open merokaa opened 7 months ago

merokaa commented 7 months ago

Hi,

I have recently switched to Pyterrier and have been pleased with the dense retrieval plug-ins. Thank you very much for creating the wonderful framework.

I have been going through the pain of building a ColBERT dense index for MSMARCO passages v2 where the process took a long time and crashed due to some technical issues halfway through.

I wonder if there is a built-in support to resume building the index; if not I would appreciate any tips about doing that while keeping the integrity of the built index. Thank you.

cmacdonald commented 7 months ago

Hi @merokaa

Thanks for kind words about PyTerrier etc, we're glad you like it.

MSMARCO v2 passage is quite large for ColBERT. We once had a kinda distributed fork of pyterrier_colbert that was a bit more suited to this, but we didnt integrate it.

There is now ColBERT v2, which uses a much more compact index representation, and may be more suited for MSMARCOv2. We havent ported pyterrier_colbert to that - but assistance would be appreciated.

Sticking on the current ColBERT/pyterrier_colbert implementation, one other suggestion is simply splitting the corpus into chunks, and indexing those separately.

Subject then to enough memory (or using the mmap option for loading the indices - but that makes things pretty slow), you could combine the retrieval transformers using the + operator:


# indexing, roughly

ColBERTIndexer("index1").index(pt.get_dataset('irds:msmarco-passage').get_corpus_iter()[0:3_000_000] )
ColBERTIndexer("index2").index(pt.get_dataset('irds:msmarco-passage').get_corpus_iter()[3_000_000:6_000_000] )
# etc

# retrieval, roughly
colbert_slice_1 = Factory("index1").end_to_end()
colbert_slice_2 = Factory("index2").end_to_end()
colbert_slice_3 = Factory("index3").end_to_end()

combined = (colbert_slice_1 + colbert_slice_2 + colbert_slice_3) % 1000

This works because, I think, IRDS iterators are sliceable.

Note that ColBERT indices can also be pruned - see https://github.com/cmacdonald/colbert_static_pruning - we were able to throw away 55% of token embeddings in MSMARCO v1 without significant impact on effectiveness.

@seanmacavaney anything I missed?

merokaa commented 7 months ago

Thank you so much @cmacdonald. That is very useful to know. I am now having some faiss issues when the data starts to be added to the index; it works fine on Colab but not on my machine :(. I'll get back to you once I resolve it.

cmacdonald commented 7 months ago

I have recently switched to Pyterrier and have been pleased with the dense retrieval plug-ins. Thank you very much for creating the wonderful framework.

Can I tweet this? It's a wonderful quote!

merokaa commented 7 months ago

I have recently switched to Pyterrier and have been pleased with the dense retrieval plug-ins. Thank you very much for creating the wonderful framework.

Can I tweet this? It's a wonderful quote!

Sure, please do! :)