opendistro-for-elasticsearch / k-NN

🆕 A machine learning plugin which supports an approximate k-NN search algorithm for Open Distro.
https://opendistro.github.io/
Apache License 2.0
277 stars 55 forks source link

Support for hamming bit distance in custom scoring #267

Closed jmazanec15 closed 3 years ago

jmazanec15 commented 3 years ago

Issue #, if available:

264

Description of changes: This PR extends custom scoring to support hamming distance on binary data. The PR allows hamming distance to be applied to Long and Binary Elasticsearch fields. With the Long fieldType, a user will be able to encode up to 64 bits of data as a signed long using Elasticsearch's Long fieldtype. Here is an example query:

POST /myindex/_search
{
 "size": 4,
 "query": {
   "script_score": {
     "query": {
       "match_all": {}
     },
     "script": {
       "source": "knn_score",
       "lang": "knn",
       "params": {
         "field": "my_long",
         "query_value": 170,
         "space_type": "hammingbit"
       }
     }
   }
 }
}

I experimented with using long arrays to extend the number of bits a user can encode past 64 bits. However, because Elasticsearch doc values are not guaranteed to be in the order they were ingested in, we would have to use the "_source" field to accomplish this, which would add performance penalties.

In addition to supporting the Long type, a user can also run hamming distance k-NN search with Elasticsearch's binary field type on arbitrarily sized data. The binary field type uses Base64 encoding scheme. An example query would look like this:

POST /myindex2/_search
{
 "size": 2,
 "query": {
   "script_score": {
     "query": {
       "match_all": {}
     },
     "script": {
       "source": "knn_score",
       "lang": "knn",
       "params": {
         "field": "my_binary",
         "query_value": "q83vQUI=",
         "space_type": "hammingbit"
       }
     }
   }
 }
}

A third possibility would be for a user to provide a hex encoded string as the binary representation of their data. However, because it would be difficult to validate this format during indexing, it could prove difficult for usage. For that reason, I did not include it in this PR. However, if there is community interest in that, I could look into it further.

I changed several other components of the custom scoring implementation in order to support types other than KNNVectors. Additionally, I moved as much of the query validation/parsing out of the execute portion of the scripts as I could to improve performance.

Lastly, I changed the parameter for passing the query value from "vector" to "query_value" in order to generalize parsing. I was hoping to get some thoughts/opinions on this.

If a user wants to perform an r-NN search, where results are only returned if they meet a certain score, they can use Elasticsearch's min_score parameter.

For testing, I created a single node cluster based off of ODFE 1.11, and installed my changes onto it. I ran the testing script attached to the PR a few times to validate recall and performance (hamming_test.txt). Additionally, I added several unit and integration tests.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

codecov[bot] commented 3 years ago

Codecov Report

Merging #267 (ecdfe06) into master (33018ad) will increase coverage by 0.95%. The diff coverage is 89.47%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #267      +/-   ##
============================================
+ Coverage     78.72%   79.68%   +0.95%     
- Complexity      338      355      +17     
============================================
  Files            53       57       +4     
  Lines          1335     1378      +43     
  Branches        121      125       +4     
============================================
+ Hits           1051     1098      +47     
+ Misses          235      234       -1     
+ Partials         49       46       -3     
Impacted Files Coverage Δ Complexity Δ
...oforelasticsearch/knn/index/util/KNNConstants.java 0.00% <ø> (ø) 0.00 <0.00> (ø)
...arch/knn/plugin/script/KNNScoringScriptEngine.java 66.66% <50.00%> (ø) 4.00 <0.00> (ø)
...lasticsearch/knn/plugin/script/KNNScoreScript.java 74.41% <74.41%> (ø) 1.00 <1.00> (?)
...arch/knn/plugin/script/KNNScoringSpaceFactory.java 88.88% <88.88%> (ø) 4.00 <4.00> (?)
...csearch/knn/plugin/script/KNNScoringSpaceUtil.java 96.55% <96.55%> (ø) 17.00 <17.00> (?)
...relasticsearch/knn/index/KNNVectorFieldMapper.java 78.78% <100.00%> (+0.16%) 15.00 <0.00> (ø)
...earch/knn/plugin/script/KNNScoreScriptFactory.java 100.00% <100.00%> (ø) 5.00 <5.00> (?)
...asticsearch/knn/plugin/script/KNNScoringSpace.java 100.00% <100.00%> (ø) 0.00 <0.00> (?)
...lasticsearch/knn/plugin/script/KNNScoringUtil.java 96.77% <100.00%> (-0.85%) 11.00 <2.00> (-4.00)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 33018ad...ecdfe06. Read the comment docs.

VijayanB commented 3 years ago

Can you update the description to use "space_type": "hammingbit"?

jmazanec15 commented 3 years ago

Can you update the description to use "space_type": "hammingbit"?

Done