sebastian-hofstaetter / matchmaker

Training & evaluation library for text-based neural re-ranking and dense retrieval models built with PyTorch
https://neural-ir-explorer.ec.tuwien.ac.at/
Apache License 2.0
259 stars 30 forks source link

How to integrate pre-trained ColBERT as retriever into retriever-reranker IR system #20

Open stefanik12 opened 2 years ago

stefanik12 commented 2 years ago

Hi! tl;dr: If it's a good idea, which matchmaker modules should I use to index and rank documents using ColBERT-like retriever with token-level vector index?

We are trying to evaluate the benefits of different retrievers in the retriever-reranker framework. We found that neural dense retrievers bring big qualitative benefits with our data set (100k questions + appx. 1mil answers from MathStackExchange). I'd like to evaluate the potential benefits of token-level neural retrieval, but struggle to fit the contextual token-level index into memory. Though it seems that this is something that matchmaker can deal with.

I've looked into dense_retrieval_evaluate README, though I presumed that I will not suffice with CLI, if I also want to proceed with reranker.

That led me to an attempt to re-utilise dense_retrieval.py, but with my limited understanding of the code, I believe that this script still expects the model to deliver a single dense representation (vector) per document.

I am wondering if dense_retrieval.py is a good place to start or there are some easier ways around integrating (and possibly permuting) different neural retrievers in the retriever-reranker IR system?

If it helps anyone, our evaluation framework can be found in this notebook.