When using function _build_ivf(self) for a large corpus, it often gets stuck at the codes = codes.sort() step.
To avoid sorting of a massive list, we can:
1) Create an ivf_dict which maps from partition index to the list of embedding indices that belong to that partition.
2) Using ivf_dict, we can easily create the following without soring:
a sorted list of embedding indices (ivf) by just concatenating the dictionary values, and
a list of the number of embeddings belonging to each partition (ivf_lengths).
When using function
_build_ivf(self)
for a large corpus, it often gets stuck at thecodes = codes.sort()
step. To avoid sorting of a massive list, we can: 1) Create anivf_dict
which maps from partition index to the list of embedding indices that belong to that partition. 2) Using ivf_dict, we can easily create the following without soring:ivf
) by just concatenating the dictionary values, andivf_lengths
).