Building ivf for large datasets - Githubissues

stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)

MIT License

2.67k stars 355 forks source link

Building ivf for large datasets #327

Open jenhsia opened 3 months ago

jenhsia commented 3 months ago

When using function _build_ivf(self) for a large corpus, it often gets stuck at the codes = codes.sort() step. To avoid sorting of a massive list, we can: 1) Create an ivf_dict which maps from partition index to the list of embedding indices that belong to that partition. 2) Using ivf_dict, we can easily create the following without soring:

a sorted list of embedding indices (ivf) by just concatenating the dictionary values, and
a list of the number of embeddings belonging to each partition (ivf_lengths).