opendistro-for-elasticsearch / k-NN

🆕 A machine learning plugin which supports an approximate k-NN search algorithm for Open Distro.
https://opendistro.github.io/
Apache License 2.0
277 stars 55 forks source link

Add binary Hamming distance to custom scoring #264

Closed jmazanec15 closed 3 years ago

jmazanec15 commented 3 years ago

The Hamming distance measures the number of positions that are different between 2 equal length sequences of characters. This feature will look to support Hamming distance for binary sequences within the custom scoring logic.

A user should be able to ingest documents that have a term that can be interpreted as binary data and then apply the k-NN custom scoring script in order to get the k most similar results based on this distance metric. In order to implement this, we will need to determine which Elasticsearch types we will use to represent binary data. For instance, Elasticsearch has the binary field type, that uses Base64 encoded values. We will also look into supporting integral types (long, int, short, byte) and their related arrays. These may offer performance improvements over Base64.

Additional enhancements to this feature include supporting fixed radius nearest neighbor search, which will return all documents that fall within a certain radius of the query data.

luyuncheng commented 3 years ago

i see there is space_bit_hamming in nmslib. is there any possible use this space in k-NN?

jmazanec15 commented 3 years ago

Hi @luyuncheng I think we could investigate adding that functionality. Because this issue is dedicated to adding Hamming distance in custom scoring, I am going to close it. Could you please open a separate issue for integrating nmslib's hamming bit space into the plugin?