Closed jmazanec15 closed 3 years ago
Merging #267 (ecdfe06) into master (33018ad) will increase coverage by
0.95%
. The diff coverage is89.47%
.
@@ Coverage Diff @@
## master #267 +/- ##
============================================
+ Coverage 78.72% 79.68% +0.95%
- Complexity 338 355 +17
============================================
Files 53 57 +4
Lines 1335 1378 +43
Branches 121 125 +4
============================================
+ Hits 1051 1098 +47
+ Misses 235 234 -1
+ Partials 49 46 -3
Impacted Files | Coverage Δ | Complexity Δ | |
---|---|---|---|
...oforelasticsearch/knn/index/util/KNNConstants.java | 0.00% <ø> (ø) |
0.00 <0.00> (ø) |
|
...arch/knn/plugin/script/KNNScoringScriptEngine.java | 66.66% <50.00%> (ø) |
4.00 <0.00> (ø) |
|
...lasticsearch/knn/plugin/script/KNNScoreScript.java | 74.41% <74.41%> (ø) |
1.00 <1.00> (?) |
|
...arch/knn/plugin/script/KNNScoringSpaceFactory.java | 88.88% <88.88%> (ø) |
4.00 <4.00> (?) |
|
...csearch/knn/plugin/script/KNNScoringSpaceUtil.java | 96.55% <96.55%> (ø) |
17.00 <17.00> (?) |
|
...relasticsearch/knn/index/KNNVectorFieldMapper.java | 78.78% <100.00%> (+0.16%) |
15.00 <0.00> (ø) |
|
...earch/knn/plugin/script/KNNScoreScriptFactory.java | 100.00% <100.00%> (ø) |
5.00 <5.00> (?) |
|
...asticsearch/knn/plugin/script/KNNScoringSpace.java | 100.00% <100.00%> (ø) |
0.00 <0.00> (?) |
|
...lasticsearch/knn/plugin/script/KNNScoringUtil.java | 96.77% <100.00%> (-0.85%) |
11.00 <2.00> (-4.00) |
|
... and 4 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 33018ad...ecdfe06. Read the comment docs.
Can you update the description to use "space_type": "hammingbit"?
Can you update the description to use "space_type": "hammingbit"?
Done
Issue #, if available:
264
Description of changes: This PR extends custom scoring to support hamming distance on binary data. The PR allows hamming distance to be applied to Long and Binary Elasticsearch fields. With the Long fieldType, a user will be able to encode up to 64 bits of data as a signed long using Elasticsearch's Long fieldtype. Here is an example query:
I experimented with using long arrays to extend the number of bits a user can encode past 64 bits. However, because Elasticsearch doc values are not guaranteed to be in the order they were ingested in, we would have to use the "_source" field to accomplish this, which would add performance penalties.
In addition to supporting the Long type, a user can also run hamming distance k-NN search with Elasticsearch's binary field type on arbitrarily sized data. The binary field type uses Base64 encoding scheme. An example query would look like this:
A third possibility would be for a user to provide a hex encoded string as the binary representation of their data. However, because it would be difficult to validate this format during indexing, it could prove difficult for usage. For that reason, I did not include it in this PR. However, if there is community interest in that, I could look into it further.
I changed several other components of the custom scoring implementation in order to support types other than KNNVectors. Additionally, I moved as much of the query validation/parsing out of the execute portion of the scripts as I could to improve performance.
Lastly, I changed the parameter for passing the query value from "vector" to "query_value" in order to generalize parsing. I was hoping to get some thoughts/opinions on this.
If a user wants to perform an r-NN search, where results are only returned if they meet a certain score, they can use Elasticsearch's min_score parameter.
For testing, I created a single node cluster based off of ODFE 1.11, and installed my changes onto it. I ran the testing script attached to the PR a few times to validate recall and performance (hamming_test.txt). Additionally, I added several unit and integration tests.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.