opendistro-for-elasticsearch / k-NN

🆕 A machine learning plugin which supports an approximate k-NN search algorithm for Open Distro.
https://opendistro.github.io/
Apache License 2.0
277 stars 55 forks source link

[On Hold] Add bit hamming #284

Open luyuncheng opened 3 years ago

luyuncheng commented 3 years ago

*Issue #283,

Description of changes:

As I see #264 that add Hamming distance in custom scoring it is a great functionality. i see there is bit_hamming space space_bit_hamming in nmslib. i think we can add this into plugin.

i refer to the code space_bit_hamming and space_bit_hamming_test, may be we could add "SpaceBitVector" into plugin and support bit_hamming space which is no optimized index.

i also refer to PR: #161 which add no optimized index for "negdotprod", i see the nmslib's python_binding code python_binding_nmslib, may be we could add a "save_data" into plugin and can store index and dataset for "no optimized index".

so, i submit this PR.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

codecov[bot] commented 3 years ago

Codecov Report

Merging #284 (c2abcb2) into master (8c802d9) will increase coverage by 1.62%. The diff coverage is 85.07%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #284      +/-   ##
============================================
+ Coverage     79.27%   80.89%   +1.62%     
- Complexity      359      413      +54     
============================================
  Files            58       62       +4     
  Lines          1404     1581     +177     
  Branches        126      147      +21     
============================================
+ Hits           1113     1279     +166     
- Misses          243      249       +6     
- Partials         48       53       +5     
Impacted Files Coverage Δ Complexity Δ
...istroforelasticsearch/knn/index/v206/KNNIndex.java 74.71% <75.00%> (-2.04%) 18.00 <7.00> (+10.00) :arrow_down:
...index/codec/KNN80Codec/KNN80DocValuesConsumer.java 75.00% <83.33%> (+4.11%) 18.00 <0.00> (+3.00)
...relasticsearch/knn/index/KNNVectorFieldMapper.java 79.22% <86.66%> (+0.27%) 17.00 <0.00> (+1.00)
...nn/index/codec/KNN80Codec/KNN80CompoundFormat.java 96.15% <90.00%> (-3.85%) 4.00 <1.00> (+1.00) :arrow_down:
...forelasticsearch/knn/index/codec/KNNCodecUtil.java 87.50% <90.90%> (+4.16%) 5.00 <3.00> (+3.00)
...opendistroforelasticsearch/knn/index/KNNQuery.java 81.81% <100.00%> (+9.59%) 10.00 <2.00> (+3.00)
...pendistroforelasticsearch/knn/index/KNNWeight.java 97.67% <100.00%> (+0.53%) 13.00 <3.00> (+3.00)
...endistroforelasticsearch/knn/index/SpaceTypes.java 100.00% <100.00%> (ø) 7.00 <2.00> (+2.00)
...earch/knn/plugin/script/KNNWhitelistExtension.java 100.00% <0.00%> (ø) 3.00% <0.00%> (?%)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8c802d9...62455ed. Read the comment docs.

jmazanec15 commented 3 years ago

Hi @luyuncheng,

Thanks for the PR! I will take a look some time early next week. My concern about the "non-optimized" indices is that we will end up storing two different files for the HNSW structure in the Lucene index. I need to think about the implications of doing this some more.

luyuncheng commented 3 years ago

Hi @luyuncheng,

Thanks for the PR! I will take a look some time early next week. My concern about the "non-optimized" indices is that we will end up storing two different files for the HNSW structure in the Lucene index. I need to think about the implications of doing this some more.

161 in this PR, it Write footer for .dat file. I reuse this as "non-optimized" data storage.

As i checked the nmslib's code, When we add "non-optimized" data storage, K-NN plugin can support all nmslib's Space Types

jmazanec15 commented 3 years ago

@luyuncheng Apologies, have not gotten a chance to look at this yet. Will review in January.

luyuncheng commented 3 years ago

Previously, this PR using string to parse the bit vector, this may occupy too much memory! As i checked the code: space_bit_hamming.h in nmslib, i think use a int32 array to represent a bit vector is better. so, add modified some code to use vector as bit vector

vamshin commented 3 years ago

This Pr is put on hold. Please refer https://github.com/opendistro-for-elasticsearch/k-NN/issues/283#issuecomment-755598088