stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.89k stars 374 forks source link

Explainability: Retrieve top token-token matches (query -> document) #90

Closed littlewine closed 2 years ago

littlewine commented 2 years ago

Hi, I am trying to shed some light into whats going on under the hood in ColBERT in some specific cases. For this reason, I would like to be able to say "the closest match of query3_token1 is document362_token5". Would something like that be possible?

From what I've seen so far getting this sort of insight is not so trivial (unless I'm missing something). faiss_index.retrieve would return a list of approximate nearest neighbor passage ids. Even if I crop and use only the query token embedding I'm interested in, I would still get a bunch of passage_ids back. Then I would still need to figure out: (a) which document is the "top" document (since my understanding is those are approximate) (b) which document token is the one matching with maxSim.

Regarding (a), I guess I could use the 'rerank' stage of ColBERT, which would calculate "exact" scores for all documents. (b) I see there's an emb2pid mapping created during indexing, but my guess is this contains no information for the (BERT) token_id it represents. So I would have to get the "embedding_id", trace back which document it belongs to and subtract the document start offset to find the token position in the document. Then I would have to tokenize the document, either using the ColBERT tokenizer, or the BERT tokenizer while incrementing for +1 position for the missing [Q] token.

Is there a more easy/straightforward way to do so, either from within ColBERT or faiss itself?

okhat commented 2 years ago

You could look at the top-100 passages from ColBERT's e2e ranking. Then use the encoders on-the-fly in pytorch to quickly find which tokens have the highest similarity. Would that work?

Otherwise you can also keep track of the tokens during indexing and then do this more directly, but it requires more changes to the code.

okhat commented 2 years ago

Let me know if I can help here. Closing for now due to inactivity on this post.