opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
157 stars 116 forks source link

[FEATURE] Move Lucene Vector field and HNSW KNN Search as a first class feature in core #1467

Open nknize opened 9 months ago

nknize commented 9 months ago

Is your feature request related to a problem? Core OpenSearch does not support Vector types as a first class field. The correlation engine has a CorrelationVectorFieldMapper that uses Lucene's KNNFloatVectorField but this is in the events-correlation-engine plugin. We could move that field mapper to the core library, but we don't want to fragment between different vector field implementations. So why not move the Lucene HNSW backed vector field and Knn search as a first class field in a core library?

What solution would you like? A discussion around making vector field type as a first class citizen in core. We've discussed this before in "person" but I don't know if we have an issue around it. I don't think there's a reason to not have Lucene vector fields and HNSW backed KNN search as a core feature and leverage the OpenSearch kNN plugin as an optional accelerator using alternative native options like FAISS or nmslib?

What alternatives have you considered? Leave as is if there is a compelling reason to keep this base Lucene capability integration in a separate downstream plugin.

Do you have any additional context? Add any other context or screenshots about the feature request here.

nknize commented 9 months ago

If we only move lucene vector field to core, there would be two different APIs: one for lucene engine and another for native engines.

Why?

heemin32 commented 9 months ago

If we only move lucene vector field to core, there would be two different APIs: one for lucene engine and another for native engines.

Why?

I thought the knn plugin API applies to both lucene and native engine but they are only for native engines. So, the previous comment is not correct. If lucene ever support model training feature by introducing new algorithm like ivf in the future, then there will be two different APIs.

nknize commented 9 months ago

I thought the knn plugin API applies to both lucene and native engine but they are only for native engines.

++ sounds good. So then I propose a vector field type in core and have the k-NN plugin extend the FieldMapper to add the native field type and queries needed to implement fiass, nmslib, others...

nknize commented 9 months ago

If there's no opposition then I'll move this issue to core because it'll essentially be a no-op for kNN. kNN can decide in the future whether to simplify the vector field type API for native implementations.

heemin32 commented 9 months ago

The index setting and mapping is shared between lucene engine and native engine currently. The engine parameter determine if it is lucene engine or native engine. I guess it might be possible to keep the current format after moving the lucene engine into core but not 100% sure. @navneet1v @jmazanec15?

PUT my-knn-index-1
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
        "my_vector1": {
          "type": "knn_vector",
          "dimension": 2,
          "method": {
            "name": "hnsw",
            "space_type": "l2",
            "engine": "lucene",
            "parameters": {
              "ef_construction": 128,
              "m": 24
            }
          }
        },
        "my_vector2": {
          "type": "knn_vector",
          "dimension": 4,
          "method": {
            "name": "hnsw",
            "space_type": "innerproduct",
            "engine": "faiss",
            "parameters": {
              "ef_construction": 256,
              "m": 48
            }
          }
        }
    }
  }
}
nknize commented 9 months ago

Not sure we can keep the same API format after moving the lucene engine into core.

Sure you can. FieldMappers implementations are independent of the underlying context.doc().add() wrapped lucene API and termQuery, existQuery, fielddataBuilder (etc.) implementations. You may want a deprecation path for BWC, of course (although looking at your example one may not be needed, just deprecate knn_vector type in favor of a new more simple vector type). I'd suggest refactoring VectorFieldMapper from the events-correlation-engine to the core mapper package (do it right though and put it in a mapper library or in the core library so it's subject to proper JPMS guardrails) and abstract it as a base. Then have the CorrelationVectorFieldMapper, KNNVectorFieldMapper, ModelFieldMapper, LuceneFieldMapper derive from the base VectorFieldMapper and implement their own concrete Parser, Builder, and MappedFieldType logic. Have the VectorFieldMapper parser delegate to the appropriate Builder based on a simple DSL parameter. You already have it as engine in your existing API, it looks like.

Also, it looks like I misunderstood your comment, I thought you said this plugin didn't end up implementing Lucene's HNSW, but it looks like it does through the LuceneFieldMapper. Although you might already know that KnnVectorField is deprecated and replaced by KnnFloatVectorField.

nknize commented 9 months ago

As a reference, this is similar to how I supported QuadTree, GeoHashPrefixTree, and BKD implementations of geo_point and geo_shape field mappers and extended them in Elastic XPack to support cartesian geometries without forcing a confusing DSL surface area on the end user. In this case, all I'm really suggesting is that you refactor KNNVectorFieldMapper and LuceneFieldMapper from this plugin into the core module (you could put them in the mapper-extras module) and haev the native backed field mappers in this plugin derive from that upstream KNNVectorFieldMapper. As an extra step... merge or remove the VectorFieldMapper class in the events-correlation-engine and have that correlation engine implement its own correlation_vector in the correlation-engine plugin.

It's just moving around the implementation to promote Lucene's HNSW vector and KNN search as a first class core field type and feature.

navneet1v commented 9 months ago

@nknize Having a vector field as a first class citizen in Opensearch is quite exciting. But I have few questions around this:

  1. What is the motivation to move the vector field as a first class citizen in Opensearch core repo?
  2. If we want to move the Lucene implementation in core then its not just the FieldMapper that needs to be moved. We need to move the QueryBuilders, Codecs, vector related index settings, distance computation functionalities too. If we just move the Mapper to core I am not sure what we will achieve as users won't be able to do the search without right query builders.
  3. The movement should be backward compatible otherwise we can loose trust from community given that vector search is so popular now a days. Re-indexing of Vectors index is far slower and costlier than a normal text index.
nknize commented 9 months ago

What is the motivation to move the vector field as a first class citizen in Opensearch core repo?

There are users of OpenSearch min distribution that need vector types and cannot use any plugins.

If we want to move the Lucene implementation in core then its not just the FieldMapper...

Correct. Is this a problem?

The movement should be backward compatible...

See my comment above: "You may want a deprecation path for BWC, of course". This isn't difficult to do.

My question here is whether there is interest or not. If not, then I may end up just adding a new field mapper to core anyway that exposes Lucene's KnnFloatVectorField and KnnByteVectorField for the users that fall in the min distribution category. I just want to make sure that's discussed here first because it will likely cause confusion w/ the kNN plugin's Lucene implementation.

navneet1v commented 9 months ago

My question here is whether there is interest or not.

I would say interest is there atleast from my side.

If not, then I may end up just adding a new field mapper to core anyway that exposes Lucene's KnnFloatVectorField and KnnByteVectorField for the users that fall in the min distribution category.

If this is done then for sure there will be big confusion and won't be good for the Opensearch as a product.

Correct. Is this a problem?

I don't see this as a problem the only thing for me is backward compatibility. It should not be like customer needs to migrate their workloads because of the above mentioned problems.

nknize commented 9 months ago

I agree 💯

jmazanec15 commented 9 months ago

If not, then I may end up just adding a new field mapper to core anyway that exposes Lucene's KnnFloatVectorField and KnnByteVectorField for the users that fall in the min distribution category.

If this is done then for sure there will be big confusion and won't be good for the Opensearch as a product.

+1 - would like to avoid this.

Overall, I think this makes sense. Moving base vector functionality to core has the following pros that I can think of

  1. Min distro user's can use vector search.
  2. Better isolate platform dependent components will make development/usage easier (no c++ necessary to use k-NN)
  3. Easier for others to build features on top of vector type. Right now, they have to add a dependency on the k-NN plugin (neural-search). Or, they have to basically duplicate all of the knn field and query functionality (correlation vector type refer to https://github.com/opensearch-project/OpenSearch/issues/6779#issuecomment-1609884193)
  4. First class support can make it easier to support features like dynamic mapping or integrate into clients

Also, philosophically, I think that core should keep field types and query types for fundamental functionality. For instance, we wouldnt have a plugin that implements numeric types (maybe an over-exageration). I think vectors have definitely exited the niche type and moved towards a pretty common type.

That being said, Im not sure on introducing a new field type vector as opposed to just moving knn_vector, but this could probably be discussed in diff issue.

vamshin commented 9 months ago

+1 to Jack's comment above.

Its high time we see vector data type as first class citizen(part of core) for the OpenSearch similar to what Lucene has done. I would support moving knn_vector field type to core rather than creating another field type mimicking the same behavior. Anyways this is going to be a big lift and shift work and involves rearchitecting some of the pieces. It will be good to make it part of a major version release.

Do we actually see backward compatibility issue? I think NO. Since core is a min distribution, moving something to core should not break customer experience. It should be vice versa i.e moving something away from core to plugin. Am i missing something here?

nknize commented 9 months ago

Do we actually see backward compatibility issue? I think NO. Since core is a min distribution, moving something to core should not break customer experience.

For the most part you're not missing anything. I would suggest renaming the mapped field type from knn_vector to just vector. But that's a simple deprecation logger and version check.

vamshin commented 9 months ago

@nknize I would prefer just keeping the field name same knn_vector to avoid any confusion and keep it smooth migration. But open to suggestions.