[RFC] Binary vector support

heemin32 commented 2 weeks ago

Overview

The increasing demand for binary format support from customers is becoming evident, with numerous instances demonstrating strong recall rates when using binary values generated from large language models (LLMs). For example, Cohere's introduction of the Cohere Embed embedding model, which inherently supports binary embeddings, has shown that binary vectors can retain 90-98% of the original search quality.

Given the impressive recall rates achieved with binary vectors, a growing number of users are seeking to leverage binary vectors in OpenSearch KNN indices to significantly reduce memory costs. By moving from float32 vectors to binary vectors, you can reduce the memory requirement by a factor of 32.

Implementing support for binary vectors in OpenSearch KNN indices is thus a highly beneficial feature, addressing customer demand and significantly lowering operational costs. This capability not only ensures high recall performance but also makes large-scale deployment more economically viable, facilitating greater adoption and efficiency.

Scope

Support binary format with Faiss, HNSW
Script scoring on binary vector using hamming distance
Support binary format with Faiss, IVF
Binary format support with radial search on Faiss

Out of scope

Support binary format with Nmslib
Binary quantization (Will be handled in a separate issue)
Rescoring (Will be handled in a separate issue)

Future extension

Binary format support with Lucene
Support of other space type(ex. Jaccard)
Exploring of java long data type to store binary vector inside OpenSearch for any performance advantage

Data flow diagram

Screenshot 2024-06-18 at 4 01 40 PM

API

Input format

User should pack their binary into byte(int8). For example, for a binary value 0, 1, 1, 0, 0, 0, 1, 1, it will be 99.

Index setting

Because we are using int8 format as input, the dimension should be a multiple of 8. We are going to support new data_type, binary. With binary data type, the hammingdistance is the only space type that we are going to support as of now. If space type is not specified, the hammingdistance will be a default value for the binary data type.

PUT test-index
{
  "settings": {// no change
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
      "my_vector1": {
        "type": "knn_vector", // no change
        "dimension": 24, // This should be multiple of 8
        "data_type": "binary",// new data type
        "method": {
          "name": "hnsw", // and also ivf
          "space_type": "hamming", // only support hamming
          "engine": "faiss"
        }
      }
    }
  }
}

Ingestion

8 bits 0, 0, 0, 0, 1, 0, 1, 0 → 1 byte 10 8 bits 1, 0, 0, 0, 1, 0, 1, 0 → 1 byte -119 8 bits 0, 1, 1, 1, 1, 0, 1, 1 → 1 byte 123

PUT test-index/_doc/1
{
   "my_vector1": [-23, 0, 123]
}

Query

Query vector will have same data format as ingestion which is binary vectors packed in byte(-128 ~ 127)

{
  "query": {
      "knn": {
        "my_vector_1": {
          "vector": [12, -12, 120],
          "k": 2
        }
    }
  }
}

Reference

Meta issue: https://github.com/opensearch-project/k-NN/issues/1764

jmazanec15 commented 1 week ago

Overall, looks good. Interface looks good. A few comments

Might be good to point to can you reference #81.

"dimension": 24, // This should be multiple of 8

In future, can we just ignore extra bits?

"space_type": "hammingdistance", // only support hammingdistance

No, I think hamming is good here. We used hammingbit for script scoring, but the bit portion is redundant. (ref: https://opensearch.org/docs/latest/search-plugins/knn/knn-score-script/)

Will there be a lower level design coming up?

heemin32 commented 1 week ago

Might be good to point to can you reference https://github.com/opensearch-project/k-NN/issues/81.

Added reference to https://github.com/opensearch-project/k-NN/issues/1764 which has the link to https://github.com/opensearch-project/k-NN/issues/81

In future, can we just ignore extra bits?

There is no much difference in user experience even if we ignore extra bit because the packing in byte is done from user side. If we support an input format of an array of binary value(ex 0, 1, 1, 0) in the future, we will pad with zero for extra bit to make it a multiple of 8.

No, I think hamming is good here.

Got it. Updated the RFC.

opensearch-project / k-NN