stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.68k stars 355 forks source link

about the ranking operation #281

Closed Hannibal046 closed 6 months ago

Hannibal046 commented 6 months ago

Hi, @okhat Thanks for the great repo! I have a hard time understanding the code for reranking. Especially in this section: https://github.com/stanford-futuredata/ColBERT/blob/706a7265b06c6b8de1e3236294394e5ada92134e/colbert/ranking/index_ranker.py#L56C7-L112

I have searched the relevant github issue and I understand these code are for efficiency from this issue.

But can I find relevant documentations somewhere to help me better understand this? It seems that it would first turn 2D embeddings to 3D embeddings with different strides (108 and 180) for matrix multiplication. But I don't get it why we need this stride parameter? Why couldn't we just do something like this:

  1. load all embeddings and corresponding doclens
  2. get the embedding per passage based on the pids
  3. padding them to the same length for matrix multiplication
  4. maxsim operation and select the topk
Hannibal046 commented 6 months ago

After digging into the code, I think I got the logic behind. This stride thing was intended to partition the whole documents into different buckets based on their length (if I understand it correctly). In the actual implementation of ColBERTv1, it split the whole documents into two buckets, one with documents whose lengths takes up 90% of the whole document collections (which is 108), and the other takes the rest.

This would save the flops for the matching process. scores = (D @ group_Q) * mask.unsqueeze(-1)

BTW, for anyone who is also curious about the stride operation, I hope this would help: image image