Open heemin32 opened 2 weeks ago
Overall, looks good. Interface looks good. A few comments
Might be good to point to can you reference #81.
"dimension": 24, // This should be multiple of 8
In future, can we just ignore extra bits?
"space_type": "hammingdistance", // only support hammingdistance
No, I think hamming is good here. We used hammingbit for script scoring, but the bit portion is redundant. (ref: https://opensearch.org/docs/latest/search-plugins/knn/knn-score-script/)
Will there be a lower level design coming up?
Might be good to point to can you reference https://github.com/opensearch-project/k-NN/issues/81.
Added reference to https://github.com/opensearch-project/k-NN/issues/1764 which has the link to https://github.com/opensearch-project/k-NN/issues/81
In future, can we just ignore extra bits?
There is no much difference in user experience even if we ignore extra bit because the packing in byte is done from user side. If we support an input format of an array of binary value(ex 0, 1, 1, 0) in the future, we will pad with zero for extra bit to make it a multiple of 8.
No, I think hamming is good here.
Got it. Updated the RFC.
Overview
The increasing demand for binary format support from customers is becoming evident, with numerous instances demonstrating strong recall rates when using binary values generated from large language models (LLMs). For example, Cohere's introduction of the Cohere Embed embedding model, which inherently supports binary embeddings, has shown that binary vectors can retain 90-98% of the original search quality.
Given the impressive recall rates achieved with binary vectors, a growing number of users are seeking to leverage binary vectors in OpenSearch KNN indices to significantly reduce memory costs. By moving from float32 vectors to binary vectors, you can reduce the memory requirement by a factor of 32.
Implementing support for binary vectors in OpenSearch KNN indices is thus a highly beneficial feature, addressing customer demand and significantly lowering operational costs. This capability not only ensures high recall performance but also makes large-scale deployment more economically viable, facilitating greater adoption and efficiency.
Scope
Out of scope
Future extension
Data flow diagram
API
Input format
User should pack their binary into byte(int8). For example, for a binary value 0, 1, 1, 0, 0, 0, 1, 1, it will be 99.
Index setting
Because we are using int8 format as input, the dimension should be a multiple of 8. We are going to support new data_type, binary. With binary data type, the hammingdistance is the only space type that we are going to support as of now. If space type is not specified, the hammingdistance will be a default value for the binary data type.
Ingestion
8 bits 0, 0, 0, 0, 1, 0, 1, 0 → 1 byte 10 8 bits 1, 0, 0, 0, 1, 0, 1, 0 → 1 byte -119 8 bits 0, 1, 1, 1, 1, 0, 1, 1 → 1 byte 123
Query
Query vector will have same data format as ingestion which is binary vectors packed in byte(-128 ~ 127)
Reference
Meta issue: https://github.com/opensearch-project/k-NN/issues/1764