[RFC] Lucene Byte Sized Vector

naveentatikonda commented 1 year ago

The purpose of this RFC (request for comments) is to gather community feedback on a new proposal of adding support for byte sized vectors in Lucene engine.

Problem Statement

As of today, k-NN plugin only supports vectors of type float for each dimension which is 4 bytes. This is getting expensive in terms of storage especially for those use cases that requires ingestion on a large scale as we need to construct, load, save and search graphs which gets more and more costlier. There are few use cases where customers prefer to reduce memory footprints by trading off recall with a minimal loss.

Using the Lucene ByteVector feature, we can add support for Byte Sized Vector where each dimension of the vector is a byte integer with range [-128 to 127].

How to convert the Float value to Byte ?

Quantization is the process of mapping continuous infinite values to a smaller set of discrete finite values. In the context of simulation and embedded computing, it is about approximating real-world values with a digital representation that introduces limits on the precision and range of a value.

We can make use of the Quantization techniques to convert float values (32 bits) to byte (8 bits) without losing much precision. There are many Quantization techniques such as Scalar Quantization, PQ (used in faiss engine), etc.

As a P0, we are not adding support for any quantization technique because the quantization technique that needs to be used depends on customer user case. So, based on customer request and usage, we will be adding support for quantization technique later.

Proposed Solution

image (3)

Initially, as we are not planning to support any quantization technique as part of our source code, so the expectation is customers provide us the quantized vectors as input that are of type byte integers within the range of [-128 to 127] for both indexing and querying. So, for users to ingest these vectors using the KnnByteVectorField we will be introducing a new optional field data_type in the index mapping. There are no changes to the query mapping.

data_type - Set this as byte if we want to index documents as ByteSized vectors; Default value is float

The example to create index, ingest and query byte sized vectors is shown below:

Creating Index with `data_type` as `byte`

PUT test-index
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
        "my_vector1": {
          "type": "knn_vector",
          "dimension": 3,
          "data_type": "byte",
          "method": {
            "name": "hnsw",
            "space_type": "l2",
            "engine": "lucene",
            "parameters": {
              "ef_construction": 128,
              "m": 24
            }
          }
        }
    }
  }
}

Ingest Documents

PUT test-index/_doc/1
{
  "my_vector1": [-126, 28, 127]
}

PUT test-index/_doc/2
{
  "my_vector1": [100, -128, 0]
}

Search Query

GET test-index/_search
{
  "size": 2,
  "query": {
    "knn": {
      "my_vector1": {
        "vector": [26, -120, 99],
        "k": 2
      }
    }
  }
}

Also, in approximate search Byte sized vectors are supported only for lucene engine. It is not supported for nmslib and faiss engines.

Benchmarking on POC

Setup Configuration

Implemented a POC using the Lucene ByteVectorField and ran benchmarks against various datasets. The cluster configuration and index mapping are shown in below table.

Instance Type	r5.16xlarge
Nodes	1
Primary Shards	24
Replicas	1
Disk Space	1000 GB
ef_search	256
ef_construction	256
m	16
engine	lucene
method	hnsw

Quantization and Normalization

Min-Max Normalization - This technique performs a linear transformation on the original data which scales the values of a feature to a range between 0 and 1. This is done by subtracting the minimum value of the feature from each value, and then dividing by the range of the feature.

Scalar Quantization - Splits the entire space of each dimension into discrete bins in order to reduce the overall memory footprint of the vector data.

Quantization Technique A For these benchmarking tests, all the datasets that are used have the vectors as float values so normalized them using min-max normalization to transform and scale the values into a range of 0 to 1. Then finally, quantized these values to bucketize them into those 256 buckets (ranging from -128 to 127).

// Min-Max Normalization with Scalar Quantization for L2 spacetype

# Random dataset (Example to create a random dataset)
dataset = np.random.uniform(-300, 300, (100, 10))
# Random query set (Example to create a random queryset)
queryset = np.random.uniform(-350, 350, (100, 10))
# Number of values
B = 256

# INDEXING:
# Get min and max
dataset_min = np.min(dataset)
dataset_max = np.max(dataset)
# Shift coordinates to be non-negative
dataset -= dataset_min
# Normalize into [0,1]
dataset *= 1. / (dataset_max - dataset_min)
# Bucket into 256 values
dataset = np.floor(dataset * (B-1)) - int(B / 2)

# QUERYING:
# Clip (if queryset range is out of datset range)
queryset = queryset.clip(dataset_min, dataset_max)
# Shift coordinates to be non-negative
queryset -= dataset_min
# Normalize
queryset *= 1. / (dataset_max - dataset_min)
# Bucket into 256 values
queryset = np.floor(queryset * (B - 1)) - int(B / 2)

Quantization Technique B Euclidean distance is shift invariant which means ||x-y||=||(x-z)-(y-z)|| (If we shift both x and y by the same z then the distance remains the same). But, cosine similarity is not (cosine(x, y) does not equal cosine(x-z, y-z)).

So, for the angular datasets to avoid shifting we will follow a different approach to quantize positive and negative values separately(pseudo code shown below for glove-200-angular dataset). There is a huge difference in the recall after using the below technique which improved the recall from 0.17 (for QT A) to 0.77 (using QT B) for glove-200.

// Min-Max Normalization with Scalar Quantization for cosinesimilarity
// For Positive Numbers

max = 2.7219
min = 0
B = 127

val = (val - min) / (max - min)
val *= B
int_part = floor(var)
frac_part = val - int_part
if 0.5 < frac_part:
 bval = int_part + 1
else:
 bval = int_part
return Byte(bval)

// Min-Max Normalization with Scalar Quantization for cosinesimilarity
// For Negative Numbers

old_max = 0
old_min = -6.6393
B = 128

min =0
max = -old_min
val = (val - min) / (max - min)
val = (val * B)
int_part = floor(var)
frac_part = val - int_part
if 0.5 < frac_part:
 bval = int_part + 1
else:
 bval = int_part
return Byte(bval)

Benchmarking Results Comparison

S. No.	Datasets	dimension of vector	train data size	query data size	train data range	query data range	space type	recall of float values	recall of byte values	index size with floats	index size with bytes	% reduction in storage	Quantization Technique
1	gist-960-euclidean	960	1,000,000	1,000	[ 0.0, 1.4772]	[ 0.0, 0.7285 ]	L2	0.99703	0.96305	16.5 gb	11.1 gb	32.727 %	QT A
2	mnist-784-euclidean	784	60, 000	10,000	[ 0.0, 255.0 ]	[ 0.0, 255.0 ]	L2	0.99997	0.99997	393.2 mb	124 mb	68.46 %	QT A
3	Fashion-mnist-784-euclidean	784	60, 000	10,000	[ 0.0, 255.0 ]	[ 0.0, 255.0 ]	L2	0.99977	0.99957	440.3 mb	171.1 mb	61.14 %	QT A
4	sift-128-euclidean	128	1,000,000	10,000	[ 0.0, 218.0 ]	[ 0.0, 184.0 ]	L2	0.99988	0.98617	1.3 gb	624.5 mb	53.087 %	QT A
5	nytimes-256-angular	256	290,000	10,000	[ -0.34186783, 0.3459612 ]	[ -0.29992762, 0.3025554 ]	cosinesimil	0.98988	0.93418	1.6 gb	1.2 gb	25 %	QT A
6	glove-50-angular	50	1,183,514	10,000	[ -6.9739, 6.5134 ]	[ -6.2462, 5.8544 ]	cosinesimil	0.99939	0.87822	1.3 gb	1 gb	23.08 %	QT A
7	glove-100-angular	100	1,183,514	10,000	[ -6.5254, 5.9609 ]	[ -6.1499, 4.2213 ]	cosinesimil	0.99684	0.92377	2.6 gb	1.9 gb	26.923 %	QT B
8	glove-200-angular	200	1,183,514	10,000	[ -6.7986, 4.609 ]	[ -6.6393, 2.7219 ]	cosinesimil	0.94615	0.77275	5.1 gb	3.7 gb	27.45 %	QT B

Observations

Almost all the euclidean datasets gave good results in terms of storage and recall.
The recall values obtained for all the angular datasets are low when compared to euclidean datasets. But, using a better quantization technique will give us even better recall based on the data in that dataset.
We can also observe that for almost all the angular datasets we are only able to save less than 30% of the storage space. But, this is not related to the spaceType cosinesimilarity.

Ran a test using 1 primary shard and zero replicas. After force merging all the segments into one segment, we can see the segment files with it’s corresponding sizes listed below. The storage space occupied by all the segment files are same for both float vectors and byte vectors except the .vec file which shows that the byte vectors(113 mb) are occupying 1/4 of the size of float vectors (452 mb) which is what we are expecting. But, still we are not seeing the expected number because of the .fdt files which is consuming 1.7 gb for both data types which is nothing but the source data.

// glove-100-angular dataset

// Using Floats
total 2.3G
-rw-r—r--. 1 ec2-user ec2-user 1.4K May 28 04:16 _1c.fdm
-rw-r—r--. 1 ec2-user ec2-user 1.7G May 28 04:16 _1c.fdt
-rw-r—r--. 1 ec2-user ec2-user  67K May 28 04:16 _1c.fdx
-rw-r—r--. 1 ec2-user ec2-user  603 May 28 05:54 _1c.fnm
-rw-r—r--. 1 ec2-user ec2-user 1.3M May 28 04:16 _1c.kdd
-rw-r—r--. 1 ec2-user ec2-user  11K May 28 04:16 _1c.kdi
-rw-r—r--. 1 ec2-user ec2-user  149 May 28 04:16 _1c.kdm
-rw-r—r--. 1 ec2-user ec2-user  670 May 28 05:54 _1c.si
-rw-r—r--. 1 ec2-user ec2-user   77 May 28 04:16 _1c_Lucene90_0.doc
-rw-r—r--. 1 ec2-user ec2-user 2.4M May 28 04:16 _1c_Lucene90_0.dvd
-rw-r—r--. 1 ec2-user ec2-user  312 May 28 04:16 _1c_Lucene90_0.dvm
-rw-r—r--. 1 ec2-user ec2-user 3.8M May 28 04:16 _1c_Lucene90_0.tim
-rw-r—r--. 1 ec2-user ec2-user 162K May 28 04:16 _1c_Lucene90_0.tip
-rw-r—r--. 1 ec2-user ec2-user  200 May 28 04:16 _1c_Lucene90_0.tmd
-rw-r—r--. 1 ec2-user ec2-user 452M May 28 05:54 _1c_Lucene95HnswVectorsFormat_0.vec
-rw-r—r--. 1 ec2-user ec2-user  81K May 28 05:54 _1c_Lucene95HnswVectorsFormat_0.vem
-rw-r—r--. 1 ec2-user ec2-user  94M May 28 05:54 _1c_Lucene95HnswVectorsFormat_0.vex
-rw-r—r--. 1 ec2-user ec2-user  375 May 28 05:54 segments_8
-rw-r—r--. 1 ec2-user ec2-user    0 May 26 22:48 write.lock

// Using Bytes
total 1.9G
-rw-r--r--. 1 ec2-user docker 1.4K May 28 04:16 _18.fdm
-rw-r--r--. 1 ec2-user docker 1.7G May 28 04:16 _18.fdt
-rw-r--r--. 1 ec2-user docker  67K May 28 04:16 _18.fdx
-rw-r--r--. 1 ec2-user docker  603 May 28 05:23 _18.fnm
-rw-r--r--. 1 ec2-user docker 1.3M May 28 04:16 _18.kdd
-rw-r--r--. 1 ec2-user docker  11K May 28 04:16 _18.kdi
-rw-r--r--. 1 ec2-user docker  149 May 28 04:16 _18.kdm
-rw-r--r--. 1 ec2-user docker  670 May 28 05:23 _18.si
-rw-r--r--. 1 ec2-user docker   77 May 28 04:16 _18_Lucene90_0.doc
-rw-r--r--. 1 ec2-user docker 2.4M May 28 04:16 _18_Lucene90_0.dvd
-rw-r--r--. 1 ec2-user docker  312 May 28 04:16 _18_Lucene90_0.dvm
-rw-r--r--. 1 ec2-user docker 3.8M May 28 04:16 _18_Lucene90_0.tim
-rw-r--r--. 1 ec2-user docker 162K May 28 04:16 _18_Lucene90_0.tip
-rw-r--r--. 1 ec2-user docker  200 May 28 04:16 _18_Lucene90_0.tmd
-rw-r--r--. 1 ec2-user docker 113M May 28 05:23 _18_Lucene95HnswVectorsFormat_0.vec
-rw-r--r--. 1 ec2-user docker  81K May 28 05:23 _18_Lucene95HnswVectorsFormat_0.vem
-rw-r--r--. 1 ec2-user docker  94M May 28 05:23 _18_Lucene95HnswVectorsFormat_0.vex
-rw-r--r--. 1 ec2-user docker  375 May 28 05:23 segments_8
-rw-r--r--. 1 ec2-user docker    0 May 26 22:07 write.lock

Feedback

Please provide your feedback and any questions about the feature are welcome.

jmazanec15 commented 1 year ago

@naveentatikonda This is a cool proposal. With this interface, I am wondering a couple of things:

We have received a lot of interest in supporting binary space types like hamming distance (ref: https://github.com/opensearch-project/k-NN/issues/81, https://github.com/opensearch-project/k-NN/issues/949). For this use case, how would you see the interface changing to support this?
There is some interest in supporting unsigned 16 bit floats, how would this look with this interface?

naveentatikonda commented 1 year ago

@naveentatikonda This is a cool proposal. With this interface, I am wondering a couple of things:

We have received a lot of interest in supporting binary space types like hamming distance (ref: [FEATURE] Hamming distance / binary vector support #81, [FEATURE] Support for approximate search using hamming distance #949). For this use case, how would you see the interface changing to support this?

There is some interest in supporting unsigned 16 bit floats, how would this look with this interface?

@jmazanec15 Thanks for your questions.

From the VectorDataType point of view, as we are using an enumerator for the vector data_type I feel may be we can add a new constant such as binary into the enum if we add support for Hamming Distance.
Similarly, for 16 bit floats may be we can expect users to provide input as float_16 for 16 bit floats and float or float_32 for 32 bit floats. Also, in our enum we can add constants as FLOAT(4) for 32 bit floats and FLOAT(2) for 16 bits.

jmazanec15 commented 1 year ago

makes sense. I think calling it byte may be too generic. What if we call it int8, like in C++ typedef? It is being treated as an 8 bit integer, not a binary value. Float can probably remain float (not float32). Something we can change in the future after the release as well.

jmazanec15 commented 1 year ago

Actually, on second thought, byte is consistent with OpenSearch so I am okay: https://opensearch.org/docs/latest/field-types/supported-field-types/numeric/

opensearch-project / k-NN