opensearch-project / neural-search

Plugin that adds dense neural retrieval into the OpenSearch ecosytem
Apache License 2.0
66 stars 67 forks source link

[FEATURE] Quantization processor in ingest pipeline #991

Open YeonghyeonKO opened 2 weeks ago

YeonghyeonKO commented 2 weeks ago

Is your feature request related to a problem?

After documents are ingested by text_embedding processor, an array of float32 type per knn_vector field is stored in segments.(hnsw or ivf)

But when setting index mappings quantized as byte or binary type, documents are no longer able to be embedded just only with text_embedding ingest processor. They need to be quantized before that can occur inconvenience to make quantizers somewhere outside of OpenSearch Clusters.

And vice versa, query vectors must be converted from float to byte type when calculating similarities with "already converted" vector fields in segments.

Is there any plan to support so called "quantization" processor when it comes to "byte" type of knn_vector field?



(except Faiss's SQfp16 and Lucene's scalar quantization which don't require to quantize before indexing)

At ingestion time, when you upload 32-bit floating-point vectors to OpenSearch, SQfp16 quantizes them into 16-bit floating-point vectors and stores the quantized vectors in a k-NN index. At search time, SQfp16 decodes the vector values back into 32-bit floating-point values for distance computation. (doc)

Starting with version 2.16, the k-NN plugin supports built-in scalar quantization for the Lucene engine. (doc)

heemin32 commented 2 weeks ago

@YeonghyeonKO Thanks for the feature request. Is it possible to ingest byte or binary data vectors using an embedding model that supports byte or binary embeddings, or does this remain a challenge even when using the text_embedding processor?

YeonghyeonKO commented 2 weeks ago

@heemin32 Yes, documents can be easily indexed using int8/binary embedding models(proprietary) like Cohere and it's not suitable for general purpose since they usually require commercial license or type of API key subscription.

But let's say we're in the closed network, developers cannot deploy pretrained-models directly provided by OpenSearch (because a firewall will block communication with artifacts.opensearch.org) unless a torch_script.zip has already been downloaded to the local. Further, it's impossible to connect to externally hosted models in this case.

In this case, exporting quantized models to torchscript(.pt and with config files) is the only way we can use? Of course, it would be great if OpenSearch supports more pretrained models returning quantized vectors (int8 or binary).

heemin32 commented 2 weeks ago

@YeonghyeonKO Thanks for sharing your use case. First, please note that if the model is not trained with int8/binary embeddings in mind, the recall tends to be suboptimal, generally below 0.9.

Here are a few options to consider Option without recall degradation

  1. Disk-based vector search: This method performs quantization, oversampling, and rescoring automatically, offering high recall with low memory usage, even if the model isn't specifically trained for int8/binary embeddings. https://opensearch.org/docs/latest/search-plugins/knn/disk-based-vector-search/

  2. Pretrained Models for Int8/Binary Embeddings: You can propose support for pretrained models that generate int8/binary embeddings in the ml-common repository.

Option with recall degradation

  1. Int8/Binary Quantization in k-NN: Currently, the k-NN plugin supports fp16 quantization(fp16 qt maintain high recall though). Extending this to support int8/binary quantization could be another approach. https://opensearch.org/docs/latest/search-plugins/knn/knn-vector-quantization/#faiss-16-bit-scalar-quantization

  2. Ingest Processor for Int8/Binary Quantization: As outlined in this GitHub issue, implementing int8/binary quantization within the ingest processor is another option.

Since the processor can only handle a single vector value at a time, it is limited to very basic quantization. I recommend considering options 1, 2, and 3 and see if the option 4 is still what you want.

YeonghyeonKO commented 1 week ago

@heemin32 Since both OpenSearch and Elasticsearch recommends Binary Quantization recently for no degradation of recall, I'll research and optimize it in my use case. You may not keep in mind using compression_level parameter with on_disk mode because of it's probable degradation of recall, right?

heemin32 commented 1 week ago

Could you share a link where they state there is no recall degradation? A compression_level of 32x in on_disk mode corresponds to binary quantization. However, in on_disk mode, oversampling and rescoring are performed automatically to minimize recall degradation.

YeonghyeonKO commented 1 week ago

Oh sorry for the confusion. 'Maintaining strong recall' is the correct expression but it still needs oversampling to minimize recall degradation.

navneet1v commented 6 days ago

@YeonghyeonKO having a QP could be an interesting feature. But there is a way to do via MLCommons only. Here is a tutorial for byte quantization: https://github.com/opensearch-project/ml-commons/blob/78a304a95d546db31139a135a1f031890a55ae72/docs/tutorials/semantic_search/semantic_search_with_byte_quantized_vector.md?plain=1#L51C14-L51C35

See if that helps in some way.

Look at the post_process_function of the ML connector.

YeonghyeonKO commented 6 days ago

@navneet1v This tutorial is truly helpful for developers who can freely use external API(Cohere or OpenAI), but not for who have restrictions in security. Is there any other way to create own connectors using OpenSearch's pretrained Models?

navneet1v commented 6 days ago

@navneet1v This tutorial is truly helpful for developers who can freely use external API(Cohere or OpenAI), but not for who have restrictions in security. Is there any other way to create own connectors using OpenSearch's pretrained Models?

Yes there is. The external model connection came later, earlier the models used to be deployed on the Opensearch ML Nodes.

What I wanted to put forward here was, there is a post and pre processing step that can be added for any model. and you can add that step for your local model. This should remove the need of having the processor.

Let me tag some folks from ML commons plugin who can ans this question. As per my understanding the post and pre should be supported with local models too.

@ylwu can you confirm that pre and post processors for a model is supported with Local Models too?