opendistro-for-elasticsearch / k-NN

🆕 A machine learning plugin which supports an approximate k-NN search algorithm for Open Distro.
https://opendistro.github.io/
Apache License 2.0
277 stars 55 forks source link

Add nmslib's bit_hamming spaces into plugin #283

Open luyuncheng opened 3 years ago

luyuncheng commented 3 years ago

As I see #264 that add Hamming distance in custom scoring it is a great functionality. i see there is bit_hamming space space_bit_hamming in nmslib. i think may be we could add this into plugin.

i refer to the code space_bit_hamming and space_bit_hamming_test, may be we could add "SpaceBitVector" into plugin and support bit_hamming space which is no optimized index.

i also refer to PR: #161 which add no optimized index for "negdotprod", i see the nmslib's python_binding code python_binding_nmslib, may be we could add a "save_data" into plugin and can store index and dataset for "no optimized index".

so i submit a PR for this.

vamshin commented 3 years ago

Hi @luyuncheng,

We have few concerns with the non optimized index for Hamming distance in nmslib. In Elasticsearch we would store serialized graph per segment which means one additional file per knn_vector field . For non optimized indices like Hamming we will end up having 2 files per segment, one to store the graph and one for the data(elements in the graph). So for large data set, it is possible to end up with large number of segments which could potentially exhaust file descriptors and run into issues of no available file descriptors. The Pr you mentioned #161 is put into hold for the very same reason. We worked with nmslib team to make optimized index for negative dot product to have one file per segment. We will have a new PR which would enable negative dot product with optimized index.

There are couple of suggestions

  1. Enable optimized index support for Hamming in nmslib and then incorporated the changes in k-NN plugin
  2. Make use of custom scoring feature for Hamming.

How about you start with the 2nd approach and let us know if you see any performance concerns with custom scoring for Hamming?

We could then take a call about having optimized/non optimized Hamming index? Till then we would like to keep your PR(https://github.com/opendistro-for-elasticsearch/k-NN/pull/284) for hamming support on hold.