Advanced PyTerrier bindings for ColBERT, including for dense indexing and retrieval. This also includes the implementations of ColBERT PRF, approximate ANN scoring and query embedding pruning.
Given an existing ColBERT checkpoint, an end-to-end ColBERT dense retrieval index can be created as follows:
from pyterrier_colbert.indexing import ColBERTIndexer
indexer = ColBERTIndexer("/path/to/checkpoint.dnn", "/path/to/index", "index_name")
indexer.index(dataset.get_corpus_iter())
An end-to-end ColBERT dense retrieval pipeline can be formulated as follows:
from pyterrier_colbert.ranking import ColBERTFactory
pytcolbert = ColBERTFactory("/path/to/checkpoint.dnn", "/path/to/index", "index_name")
dense_e2e = pytcolbert.end_to_end()
A ColBERT re-ranker of BM25 can be formulated as follows (you will need to have an index with text saved - the Terrier data repostiory conviniently provides such an index):
bm25 = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed_text', wmodel='BM25', metadata=['docno', 'text'])
sparse_colbert = bm25 >> pytcolbert.text_scorer()
Thereafter it is possible to conduct a side-by-side comparison of effectiveness:
pt.Experiment(
[bm25, sparse_colbert, dense_e2e],
dataset.get_topics(),
dataset.get_qrels(),
eval_metrics=["map", "ndcg_cut_10"],
names=["BM25", "BM25 >> ColBERT", "Dense ColBERT"]
)
You can use ColBERTFactory to obtain ColBERT PRF pipelines, as follows:
colbert_prf_rank = pytcolbert.prf(rerank=False)
colbert_prf_rerank = pytcolbert.prf(rerank=True)
ColBERT PRF requires the ColBERT index to have aligned token ids. During indexing, use the ids=True
kwarg for ColBERTIndexer, as follows:
indexer = ColBERTIndexer("/path/to/checkpoint.dnn", "/path/to/index", "index_name", ids=True)
If you use ColBERT PRF in your research, you must cite our ICTIR 2021 paper (citation included below).
All of our results files are available from the paper's Virtual Appendix.
This repository contains code to apply the techniques of query embedding pruning [Tonellotto21] and approximate ANN ranking [Macdonald21a].
Query Emebdding pruning can be applied using the following pipeline:
qep_pipe5 = (factory.query_encoder()
>> pyterrier_colbert.pruning.query_embedding_pruning(factory, 5)
>> factory.set_retrieve(query_encoded=True)
>> factory.index_scorer(query_encoded=False)
)
where 5 is the number of query embeddings based on collection frquency to retain.
Approximate ANN scoring can be applied using the following pipeline:
ann_pipe = (factory.ann_retrieve_score() % 200) >> factory.index_scorer(query_encoded=True)
where 200 is the number of top-scored ANN candidates to forward for exact scoring.
You will need a GPU to use this. Preferable more than one. You will also need lots of RAM - ColBERT requires you load the entire index into memory.
Name | Corpus size | Indexing Time | Index Size |
---|---|---|---|
Vaswani | 11k abstracts | 2 minutes (1 GPU) | 163 MB |
MSMARCO Passages | 8M passages | ~24 hours (1 GPU) | 192 GB |
This package can be installed using Pip, and then used with PyTerrier. See also the examples notebooks.
pip install -q git+https://github.com/terrierteam/pyterrier_colbert.git
conda install -c pytorch faiss-gpu=1.6.5 # or faiss-cpu
#on Colab: pip install faiss-gpu==1.6.5
NB: ColBERT requires FAISS, namely the faiss-gpu package, to be installed. pip install faiss-gpu
does NOT usually work.
FAISS recommends using Anaconda to install faiss-gpu.
On Colab, you need to resort to pip install. We recommend faiss-gpu version 1.6.3, not 1.7.0.