Closed dhmw closed 1 week ago
Thanks for creating the GH issue. This will be a valuable enhancement to already present script score based search for embeddings
Issue is this is supposed to return float[] not necessarily float[]. This happens when a doc has multiple values for a field. See https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/fielddata/ScriptDocValues.java#L309. Right now, I believe the knn_vector does not support multi-fields.
There actually is a way to get per dimension. To do this, you have to run:
curl -X PUT "localhost:9200/target_index" -H 'Content-Type: application/json' -d'
{
"settings" : {
"index.knn": true
},
"mappings": {
"properties": {
"target_field": {
"type": "knn_vector",
"dimension": 2
}
}
}
}
'
curl -X PUT "localhost:9200/_bulk" -H 'Content-Type: application/json' -d'
{ "index": { "_index": "target_index" } }
{ "target_field": [1.5, 5.5]}
{ "index": { "_index": "target_index" } }
{ "target_field": [0.5, 5.5]}
'
curl localhost:9200/_refresh
curl -XGET "http://localhost:9200/target_index/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 2,
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "doc[params.field].getValue()[0]",
"params": {
"field": "target_field"
}
}
}
}
}
'
...
{
"took" : 13,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.5,
"hits" : [
{
"_index" : "target_index",
"_id" : "RGtgA5MBjntBBRdo-f-Y",
"_score" : 1.5,
"_source" : {
"target_field" : [
1.5,
5.5
]
}
},
{
"_index" : "target_index",
"_id" : "RWtgA5MBjntBBRdo-f-Y",
"_score" : 0.5,
"_source" : {
"target_field" : [
0.5,
5.5
]
}
}
]
}
}
Is your feature request related to a problem?
The vector element access is not implemented here: https://github.com/opensearch-project/k-NN/blob/eb0a3c7454cb33346f135161beb21f46f43b8457/src/main/java/org/opensearch/knn/index/KNNVectorScriptDocValues.java#L70
Is there a good reason for this?
What solution would you like?
It would be nice to be able to access the vector values in scripts, for example in pre-filtering documents which do not meet a minimum vector element condition. This is useful when the vector represents classification of features, and we want to exclude documents from kNN search when a feature is present.
e.g.
What alternatives have you considered?
Currently, we would have to scan and index additional fields on our documents and use a regular field range query, but this requires storing redundant information in the document.