[FEATURE] Hamming distance / binary vector support

jaredbrowarnik commented 3 years ago

Are there any plans to support Hamming distance / efficient binary vector storage when using HNSW-based KNN? It seems like the underlying nmslib has support for it (https://github.com/nmslib/nmslib/issues/306). This would help give parity with binary indexes in faiss: https://github.com/facebookresearch/faiss/wiki/Binary-indexes.

jmazanec15 commented 3 years ago

Hi @jaredbrowarnik

At the moment, we do not have a plan to add binary index support. Currently, we are working faiss support #27.

Adding binary vector support would be a big project. We would probably need to create a new field type.

That being said, to the community, please +1 if you would like to see this feature in the plugin.

jaredbrowarnik commented 3 years ago

Got it, thanks for the feedback @jmazanec15.

So would the added faiss support include support for their binary indexes (e.g. IndexBinaryHNSW)?

jmazanec15 commented 3 years ago

@jaredbrowarnik no, faiss support will not include it. This would be another project. I think from a high level, we might need to:

create a new data type, similar to knn_vector, but for binary data
enhance our existing codec/jni to support binary indices as well
add another query type or enhance the existing one we have

hubenjm commented 2 years ago

Would very much appreciate this support since having dense float vectors presents a much bigger challenge if trying to scale to 10 billion documents.

I also have had a difficult time trying to figure out if ElasticSearch 7.15 would support bit_hamming space for binary vectors or equivalent (e.g. a base64 encoded string) and/or if the Script Score k-NN approach would even be feasible with that many documents (see https://opensearch.org/docs/latest/search-plugins/knn/knn-score-script/#getting-started-with-the-score-script-for-binary-data)

Any thoughts on the above?

hubenjm commented 2 years ago

+1

prems1891 commented 2 years ago

+1

paragor commented 2 years ago

+1

TaeWoo21 commented 1 year ago

+1

gilamsalem commented 1 year ago

+1

abbottdev commented 1 year ago

As the faiss support has been implemented, can we use the approximate hamming distance on binary types yet? https://github.com/opensearch-project/k-NN/issues/70

Faiss has a BinaryIndex type: https://github.com/facebookresearch/faiss/wiki/Binary-indexes so I dont think it needs to be a separate project, just that we should be able to use hamming as a space argument when performing a kNN query?

jmazanec15 commented 1 year ago

Hi @abbottdev, yes I think faiss and nmslib support binary indices - we could leverage them. Could you describe your use case a little bit more - what problem space are you using this for? In #949 you had mentioned you had 500M+ 64 bit-binary vectors. How did you decide to use binary over float representations?

abbottdev commented 1 year ago

@jmazanec15 - My use case I think may be a little different to the OP. But for me, the case is that we want to use a binary vector database in order to index PDQ image hashes. Because this is a perceptual hash problem, hashes that are "closer" mean they are more relevant, and for the algorithm in question the closeness is determined by the Hamming distance between 2 hashes based on a 256 bit binary vector. So this isn't strictly a standard ML/float vector model. The number of hashes in our instance may not be in the 500m+ range, we would likely be closer to a few hundred thousand.

For reference details on the hash if you're curious: https://github.com/facebook/ThreatExchange/tree/main/pdq Page 11 on here: https://github.com/facebook/ThreatExchange/blob/main/hashing/hashing.pdf

hubenjm commented 1 year ago

For me the primary value of binary vectors is they take up less space in an index which makes it cheaper to scale up to larger numbers of vectors, e.g. billions. That was my main concern when I asked about this years ago.

vamshin commented 1 year ago

@abbottdev @hubenjm how does the recall look from your experiments operating on binary indices for your use case?

gilamsalem commented 1 year ago

In our use case the binary value is hash representing some file, and we would like to be able to search for "similar" files/hashes (lowest hamming distance) within a repository of 1B files. We tested it with the exact search option which supports hamming distance on binary values, and it worked for small number of hashes but doesn't really scale well with higher number.

abbottdev commented 1 year ago

@vamshin (Please forgive me, I'm not from an ML background so I dont really have any answers here.) We've not used any binary indicies yet because we discarded the option of using FAISS because it didnt fit into our backend stack neatly - but this is the reference implementation that the PDQ solution I linked to above uses.

hubenjm commented 10 months ago

@abbottdev @hubenjm how does the recall look from your experiments operating on binary indices for your use case?

I didn't really run any experiments on recall because I abandoned using binary vectors and instead used lower dimensional float vectors (e.g. 128 dimensional). Still takes up a lot more space than 2048 binary ints, but at least there's better support for floats. I still believe this would be a very useful feature if it ever gets prioritized.

frejonb commented 6 months ago

Binary vectors are becoming very relevant these days, see https://txt.cohere.com/int8-binary-embeddings/, https://huggingface.co/blog/embedding-quantization#binary-rescoring, https://blog.pgvecto.rs/my-binary-vector-search-is-better-than-your-fp32-vectors. It would be awesome to have this supported in OpenSearch.

vamshin commented 6 months ago

@frejonb, added to roadmap. We will have the release tagged soon.

abbottdev commented 4 months ago

Any updates @vamshin?

vamshin commented 4 months ago

@abbottdev we target this for 2.16. @shatejas looking into it

asfoorial commented 2 months ago

Is this going to work with the neural-search plugin? Is the query embedding going to be converted into a bit vector automatically?

heemin32 commented 2 months ago

Is this going to work with the neural-search plugin? Is the query embedding going to be converted into a bit vector automatically?

It will work in neural-search if the model can generate the binary embedding with correct format(packed byte). The automatic quantization to binary vector will come in https://github.com/opensearch-project/k-NN/issues/1779

opensearch-project / k-NN

[FEATURE] Hamming distance / binary vector support #81