opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
143 stars 108 forks source link

[FEATURE] Support for Faiss byte vector #1659

Open naveentatikonda opened 2 months ago

naveentatikonda commented 2 months ago

Is your feature request related to a problem? For lucene engine we have Lucene byte vector feature, which accepts byte vectors in the range [-128 to 127] providing memory savings upto 75% when compared with fp32 vectors. But, for large scale workloads we usually prefer to use faiss engine and as of today Faiss only supports fp32 and fp16 vectors(using SQfp16). So, adding byte vector support to faiss engine helps to reduce memory requirements especially for those users who are using LLM like Cohere Embed that generates signed int8 embeddings ranging from [-128 to 127].

What solution would you like? Add a new Faiss ScalarQuantizer like QT_8bit_direct which doesn't require training and quantizes fp32 vector values (within signed byte range and without any precision) into byte sized vectors reducing memory footprints by a factor of 4. https://faiss.ai/cpp_api/struct/structfaiss_1_1ScalarQuantizer.html

https://github.com/facebookresearch/faiss/issues/3488

jmazanec15 commented 2 months ago

@naveentatikonda what quantization technique is used?

naveentatikonda commented 2 months ago

@naveentatikonda what quantization technique is used?

Scalar Quantization like SQfp16

jmazanec15 commented 2 months ago

Right, but how do they implement to 8-bit. I dont think they can quantize into fp8 because too much precision would be lost

naveentatikonda commented 2 months ago

Right, but how do they implement to 8-bit. I dont think they can quantize into fp8 because too much precision would be lost

Yes, basically they are serializing fp32 values into uint8(0 to 255) which leads to complete loss of precision when they deserialize it back into float. This feature helps to optimize memory at a cost of recall. Also, if the vector dimension is a multiple of 16 they are processing 16 values in each iteration (unlike 8 values that we have seen with fp16), so I'm hoping it might boost the performance and helps to reduce search latencies.

jmazanec15 commented 2 months ago

they are serializing fp32 values into uint8(0 to 255)

But how are they doing this? Would they take 0.2212 -> 0?

naveentatikonda commented 2 months ago

But how are they doing this? Would they take 0.2212 -> 0?

Yes, you are right. For QT_8bit_direct(without training) they are casting float to uint8 to serialize it. That's the reason I mentioned, it leads to complete loss of precision. https://github.com/facebookresearch/faiss/blob/main/faiss/impl/ScalarQuantizer.cpp#L521-L531

For QT_8bit (non uniform training) and QT_8bit_uniform(uniform training), they are multiplying and dividing it by a factor during encoding and decoding. Might need to run some more tests on all these quantization types using some latest datasets to see if it helps to preserve some precision when compared to QT_8bit_direct. Few months back when I ran some tests using sift and mnist datasets, QT_8bit_direct gave better recall when compared to the other QTypes. But, we shouldn't rely on them because data in those 2 datasets lies within the uint8 range and doesn't have any precision.

jmazanec15 commented 2 months ago

From an interface perspective, I think this should then just be byte vector support for faiss. Otherwise, it may confuse users who expect it to behave like Lucene's 8-bit scalar quantization.