Closed naveentatikonda closed 1 year ago
@naveentatikonda This is a cool proposal. With this interface, I am wondering a couple of things:
@naveentatikonda This is a cool proposal. With this interface, I am wondering a couple of things:
- We have received a lot of interest in supporting binary space types like hamming distance (ref: [FEATURE] Hamming distance / binary vector support #81, [FEATURE] Support for approximate search using hamming distance #949). For this use case, how would you see the interface changing to support this?
- There is some interest in supporting unsigned 16 bit floats, how would this look with this interface?
@jmazanec15 Thanks for your questions.
binary
into the enum if we add support for Hamming Distance.float_16
for 16 bit floats and float
or float_32
for 32 bit floats. Also, in our enum we can add constants as FLOAT(4) for 32 bit floats and FLOAT(2) for 16 bits. makes sense. I think calling it byte may be too generic. What if we call it int8, like in C++ typedef? It is being treated as an 8 bit integer, not a binary value. Float can probably remain float (not float32). Something we can change in the future after the release as well.
Actually, on second thought, byte is consistent with OpenSearch so I am okay: https://opensearch.org/docs/latest/field-types/supported-field-types/numeric/
The purpose of this RFC (request for comments) is to gather community feedback on a new proposal of adding support for byte sized vectors in Lucene engine.
Problem Statement
As of today, k-NN plugin only supports vectors of type float for each dimension which is 4 bytes. This is getting expensive in terms of storage especially for those use cases that requires ingestion on a large scale as we need to construct, load, save and search graphs which gets more and more costlier. There are few use cases where customers prefer to reduce memory footprints by trading off recall with a minimal loss.
Using the Lucene ByteVector feature, we can add support for Byte Sized Vector where each dimension of the vector is a byte integer with range [-128 to 127].
How to convert the Float value to Byte ?
Quantization is the process of mapping continuous infinite values to a smaller set of discrete finite values. In the context of simulation and embedded computing, it is about approximating real-world values with a digital representation that introduces limits on the precision and range of a value.
We can make use of the Quantization techniques to convert float values (32 bits) to byte (8 bits) without losing much precision. There are many Quantization techniques such as Scalar Quantization, PQ (used in faiss engine), etc.
As a P0, we are not adding support for any quantization technique because the quantization technique that needs to be used depends on customer user case. So, based on customer request and usage, we will be adding support for quantization technique later.
Proposed Solution
Initially, as we are not planning to support any quantization technique as part of our source code, so the expectation is customers provide us the quantized vectors as input that are of type byte integers within the range of [-128 to 127] for both indexing and querying. So, for users to ingest these vectors using the
KnnByteVectorField
we will be introducing a new optional fielddata_type
in the index mapping. There are no changes to the query mapping.data_type
- Set this asbyte
if we want to index documents as ByteSized vectors; Default value isfloat
The example to create index, ingest and query byte sized vectors is shown below:
Creating Index with
data_type
asbyte
Ingest Documents
Search Query
Also, in approximate search Byte sized vectors are supported only for
lucene
engine. It is not supported fornmslib
andfaiss
engines.Benchmarking on POC
Setup Configuration
Implemented a POC using the Lucene ByteVectorField and ran benchmarks against various datasets. The cluster configuration and index mapping are shown in below table.
Quantization and Normalization
Min-Max Normalization - This technique performs a linear transformation on the original data which scales the values of a feature to a range between 0 and 1. This is done by subtracting the minimum value of the feature from each value, and then dividing by the range of the feature.
Scalar Quantization - Splits the entire space of each dimension into discrete bins in order to reduce the overall memory footprint of the vector data.
Quantization Technique A For these benchmarking tests, all the datasets that are used have the vectors as float values so normalized them using min-max normalization to transform and scale the values into a range of 0 to 1. Then finally, quantized these values to bucketize them into those 256 buckets (ranging from -128 to 127).
Quantization Technique B Euclidean distance is shift invariant which means ||x-y||=||(x-z)-(y-z)|| (If we shift both x and y by the same z then the distance remains the same). But, cosine similarity is not (cosine(x, y) does not equal cosine(x-z, y-z)).
So, for the angular datasets to avoid shifting we will follow a different approach to quantize positive and negative values separately(pseudo code shown below for glove-200-angular dataset). There is a huge difference in the recall after using the below technique which improved the recall from 0.17 (for QT A) to 0.77 (using QT B) for glove-200.
Benchmarking Results Comparison
Observations
Ran a test using 1 primary shard and zero replicas. After force merging all the segments into one segment, we can see the segment files with it’s corresponding sizes listed below. The storage space occupied by all the segment files are same for both float vectors and byte vectors except the .vec file which shows that the byte vectors(113 mb) are occupying 1/4 of the size of float vectors (452 mb) which is what we are expecting. But, still we are not seeing the expected number because of the .fdt files which is consuming 1.7 gb for both data types which is nothing but the source data.
Feedback
Please provide your feedback and any questions about the feature are welcome.