Closed littlewine closed 2 years ago
You could look at the top-100 passages from ColBERT's e2e ranking. Then use the encoders on-the-fly in pytorch to quickly find which tokens have the highest similarity. Would that work?
Otherwise you can also keep track of the tokens during indexing and then do this more directly, but it requires more changes to the code.
Let me know if I can help here. Closing for now due to inactivity on this post.
Hi, I am trying to shed some light into whats going on under the hood in ColBERT in some specific cases. For this reason, I would like to be able to say "the closest match of query3_token1 is document362_token5". Would something like that be possible?
From what I've seen so far getting this sort of insight is not so trivial (unless I'm missing something).
faiss_index.retrieve
would return a list of approximate nearest neighbor passage ids. Even if I crop and use only the query token embedding I'm interested in, I would still get a bunch of passage_ids back. Then I would still need to figure out: (a) which document is the "top" document (since my understanding is those are approximate) (b) which document token is the one matching withmaxSim
.Regarding (a), I guess I could use the 'rerank' stage of ColBERT, which would calculate "exact" scores for all documents. (b) I see there's an
emb2pid
mapping created during indexing, but my guess is this contains no information for the (BERT) token_id it represents. So I would have to get the "embedding_id", trace back which document it belongs to and subtract the document start offset to find the token position in the document. Then I would have to tokenize the document, either using the ColBERT tokenizer, or the BERT tokenizer while incrementing for+1
position for the missing[Q]
token.Is there a more easy/straightforward way to do so, either from within ColBERT or faiss itself?