spotify / voyager

🛰️ An approximate nearest-neighbor search library for Python and Java with a focus on ease of use, simplicity, and deployability.
https://spotify.github.io/voyager/
Apache License 2.0
1.26k stars 51 forks source link

Implement order preserving transform for inner product #24

Closed dylanrb123 closed 10 months ago

dylanrb123 commented 10 months ago

As described in this paper this is a mechanism to reduce the maximum inner-product search problem into a nearest-neighbors search problem by adding an extra dimension to each vector which preserves the triangle inequality.

Per the paper:

The triangle inequality does not hold between vectors x, yi, and yj when an inner product compares them, as is the case in MIP. Many efficient search data structures rely on the triangle inequality, and if MIP can be transformed to NN with its Euclidian distance, these data structures would immediately become applicable. Our first theorem states that MIP can be reduced to NN by having an Euclidian metric in one more dimension than the original problem.

This extra dimension is equal to sqrt(max_norm**2 - norm(x)**2) for each vector x in the dataset, where max_norm is the maximum norm of all vectors in the dataset.

One thing to note on the implementation in this library, since Voyager is not exclusively meant to be used in batch and allows adding new items after the index was initially built, there is no way of knowing what the maximum norm for the dataset will be since the dataset is unknown at build time. As such, we simply calculate the extra dimension based on the data that we have seen so far. This means that if you add a new vector with a larger norm than anything seen so far, the accuracy of the index will suffer. This is similar to the approach taken by Vespa, see their blog post on the matter here. If you have a priori knowledge of your dataset it is recommended that you insert the item with the largest norm first.

Still TODO: document this feature in the python and java docs