Open nknize opened 9 months ago
If we only move lucene vector field to core, there would be two different APIs: one for lucene engine and another for native engines.
Why?
If we only move lucene vector field to core, there would be two different APIs: one for lucene engine and another for native engines.
Why?
I thought the knn plugin API applies to both lucene and native engine but they are only for native engines. So, the previous comment is not correct. If lucene ever support model training feature by introducing new algorithm like ivf in the future, then there will be two different APIs.
I thought the knn plugin API applies to both lucene and native engine but they are only for native engines.
++ sounds good. So then I propose a vector
field type in core and have the k-NN plugin extend the FieldMapper to add the native field type and queries needed to implement fiass, nmslib, others...
If there's no opposition then I'll move this issue to core because it'll essentially be a no-op for kNN. kNN can decide in the future whether to simplify the vector field type API for native implementations.
The index setting and mapping is shared between lucene engine and native engine currently. The engine
parameter determine if it is lucene engine or native engine. I guess it might be possible to keep the current format after moving the lucene engine into core but not 100% sure. @navneet1v @jmazanec15?
PUT my-knn-index-1
{
"settings": {
"index": {
"knn": true,
"knn.algo_param.ef_search": 100
}
},
"mappings": {
"properties": {
"my_vector1": {
"type": "knn_vector",
"dimension": 2,
"method": {
"name": "hnsw",
"space_type": "l2",
"engine": "lucene",
"parameters": {
"ef_construction": 128,
"m": 24
}
}
},
"my_vector2": {
"type": "knn_vector",
"dimension": 4,
"method": {
"name": "hnsw",
"space_type": "innerproduct",
"engine": "faiss",
"parameters": {
"ef_construction": 256,
"m": 48
}
}
}
}
}
}
Not sure we can keep the same API format after moving the lucene engine into core.
Sure you can. FieldMappers implementations are independent of the underlying context.doc().add()
wrapped lucene API and termQuery
, existQuery
, fielddataBuilder
(etc.) implementations. You may want a deprecation path for BWC, of course (although looking at your example one may not be needed, just deprecate knn_vector
type in favor of a new more simple vector
type). I'd suggest refactoring VectorFieldMapper
from the events-correlation-engine
to the core mapper package (do it right though and put it in a mapper library or in the core library so it's subject to proper JPMS guardrails) and abstract it as a base. Then have the CorrelationVectorFieldMapper
, KNNVectorFieldMapper
, ModelFieldMapper
, LuceneFieldMapper
derive from the base VectorFieldMapper
and implement their own concrete Parser, Builder, and MappedFieldType logic. Have the VectorFieldMapper
parser delegate to the appropriate Builder based on a simple DSL parameter. You already have it as engine
in your existing API, it looks like.
Also, it looks like I misunderstood your comment, I thought you said this plugin didn't end up implementing Lucene's HNSW, but it looks like it does through the LuceneFieldMapper
. Although you might already know that KnnVectorField
is deprecated and replaced by KnnFloatVectorField
.
As a reference, this is similar to how I supported QuadTree, GeoHashPrefixTree, and BKD implementations of geo_point
and geo_shape
field mappers and extended them in Elastic XPack to support cartesian geometries without forcing a confusing DSL surface area on the end user. In this case, all I'm really suggesting is that you refactor KNNVectorFieldMapper
and LuceneFieldMapper
from this plugin into the core module (you could put them in the mapper-extras
module) and haev the native backed field mappers in this plugin derive from that upstream KNNVectorFieldMapper
. As an extra step... merge or remove the VectorFieldMapper
class in the events-correlation-engine
and have that correlation engine implement its own correlation_vector
in the correlation-engine plugin.
It's just moving around the implementation to promote Lucene's HNSW vector and KNN search as a first class core field type and feature.
@nknize Having a vector field as a first class citizen in Opensearch is quite exciting. But I have few questions around this:
What is the motivation to move the vector field as a first class citizen in Opensearch core repo?
There are users of OpenSearch min distribution that need vector types and cannot use any plugins.
If we want to move the Lucene implementation in core then its not just the FieldMapper...
Correct. Is this a problem?
The movement should be backward compatible...
See my comment above: "You may want a deprecation path for BWC, of course". This isn't difficult to do.
My question here is whether there is interest or not. If not, then I may end up just adding a new field mapper to core anyway that exposes Lucene's KnnFloatVectorField
and KnnByteVectorField
for the users that fall in the min distribution category. I just want to make sure that's discussed here first because it will likely cause confusion w/ the kNN plugin's Lucene implementation.
My question here is whether there is interest or not.
I would say interest is there atleast from my side.
If not, then I may end up just adding a new field mapper to core anyway that exposes Lucene's KnnFloatVectorField and KnnByteVectorField for the users that fall in the min distribution category.
If this is done then for sure there will be big confusion and won't be good for the Opensearch as a product.
Correct. Is this a problem?
I don't see this as a problem the only thing for me is backward compatibility. It should not be like customer needs to migrate their workloads because of the above mentioned problems.
I agree 💯
If not, then I may end up just adding a new field mapper to core anyway that exposes Lucene's KnnFloatVectorField and KnnByteVectorField for the users that fall in the min distribution category.
If this is done then for sure there will be big confusion and won't be good for the Opensearch as a product.
+1 - would like to avoid this.
Overall, I think this makes sense. Moving base vector functionality to core has the following pros that I can think of
Also, philosophically, I think that core should keep field types and query types for fundamental functionality. For instance, we wouldnt have a plugin that implements numeric types (maybe an over-exageration). I think vectors have definitely exited the niche type and moved towards a pretty common type.
That being said, Im not sure on introducing a new field type vector
as opposed to just moving knn_vector
, but this could probably be discussed in diff issue.
+1 to Jack's comment above.
Its high time we see vector data type as first class citizen(part of core) for the OpenSearch similar to what Lucene has done. I would support moving knn_vector field type to core rather than creating another field type mimicking the same behavior. Anyways this is going to be a big lift and shift work and involves rearchitecting some of the pieces. It will be good to make it part of a major version release.
Do we actually see backward compatibility issue? I think NO. Since core is a min distribution, moving something to core should not break customer experience. It should be vice versa i.e moving something away from core to plugin. Am i missing something here?
Do we actually see backward compatibility issue? I think NO. Since core is a min distribution, moving something to core should not break customer experience.
For the most part you're not missing anything. I would suggest renaming the mapped field type from knn_vector
to just vector
. But that's a simple deprecation logger and version check.
@nknize I would prefer just keeping the field name same knn_vector
to avoid any confusion and keep it smooth migration. But open to suggestions.
Is your feature request related to a problem? Core OpenSearch does not support Vector types as a first class field. The correlation engine has a
CorrelationVectorFieldMapper
that uses Lucene'sKNNFloatVectorField
but this is in theevents-correlation-engine
plugin. We could move that field mapper to the core library, but we don't want to fragment between different vector field implementations. So why not move the Lucene HNSW backed vector field and Knn search as a first class field in a core library?What solution would you like? A discussion around making
vector
field type as a first class citizen in core. We've discussed this before in "person" but I don't know if we have an issue around it. I don't think there's a reason to not have Lucene vector fields and HNSW backed KNN search as a core feature and leverage the OpenSearch kNN plugin as an optional accelerator using alternative native options like FAISS or nmslib?What alternatives have you considered? Leave as is if there is a compelling reason to keep this base Lucene capability integration in a separate downstream plugin.
Do you have any additional context? Add any other context or screenshots about the feature request here.