[FEATURE] Element access in KNNVectorScriptDocValues ?

dhmw commented 4 weeks ago

Is your feature request related to a problem?

The vector element access is not implemented here: https://github.com/opensearch-project/k-NN/blob/eb0a3c7454cb33346f135161beb21f46f43b8457/src/main/java/org/opensearch/knn/index/KNNVectorScriptDocValues.java#L70

Is there a good reason for this?

What solution would you like?

It would be nice to be able to access the vector values in scripts, for example in pre-filtering documents which do not meet a minimum vector element condition. This is useful when the vector represents classification of features, and we want to exclude documents from kNN search when a feature is present.

e.g.


GET vectors-idx/_search
{
  "query": {
    "bool": {
        "must": [
          {
            "knn": {
              "vectors.my-vector": {
                "vector": [ ... values ...  ],
                "k": 10
              }
            }
          }
        ],
        "filter": [
          {
            "script": {
              "script": {
                "source": "return doc['vectors.my-vector'][12] < 0.05"
              }
            }
          }
        ]
      }
  },
  "sort": [
    {
      "_score": "desc"
    }
  ],
  "size": 10
}

What alternatives have you considered?

Currently, we would have to scan and index additional fields on our documents and use a regular field range query, but this requires storing redundant information in the document.

navneet1v commented 3 weeks ago

Thanks for creating the GH issue. This will be a valuable enhancement to already present script score based search for embeddings

jmazanec15 commented 2 weeks ago

https://github.com/opensearch-project/k-NN/blob/eb0a3c7454cb33346f135161beb21f46f43b8457/src/main/java/org/opensearch/knn/index/KNNVectorScriptDocValues.java#L70

Issue is this is supposed to return float[] not necessarily float[]. This happens when a doc has multiple values for a field. See https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/fielddata/ScriptDocValues.java#L309. Right now, I believe the knn_vector does not support multi-fields.

There actually is a way to get per dimension. To do this, you have to run:

curl -X PUT "localhost:9200/target_index" -H 'Content-Type: application/json' -d'
{
  "settings" : {
    "index.knn": true
  },
  "mappings": {
       "properties": {
       "target_field": {
           "type": "knn_vector",
           "dimension": 2
      }
   }
  }
}
'

curl -X PUT "localhost:9200/_bulk" -H 'Content-Type: application/json' -d'
{ "index": { "_index": "target_index" } }
{ "target_field": [1.5, 5.5]}
{ "index": { "_index": "target_index" } }
{ "target_field": [0.5, 5.5]}
'

curl localhost:9200/_refresh

curl -XGET "http://localhost:9200/target_index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 2,
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "doc[params.field].getValue()[0]",
        "params": {
          "field": "target_field"
        }
      }
    }
  }
}
'
...
{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.5,
    "hits" : [
      {
        "_index" : "target_index",
        "_id" : "RGtgA5MBjntBBRdo-f-Y",
        "_score" : 1.5,
        "_source" : {
          "target_field" : [
            1.5,
            5.5
          ]
        }
      },
      {
        "_index" : "target_index",
        "_id" : "RWtgA5MBjntBBRdo-f-Y",
        "_score" : 0.5,
        "_source" : {
          "target_field" : [
            0.5,
            5.5
          ]
        }
      }
    ]
  }
}

opensearch-project / k-NN

[FEATURE] Element access in KNNVectorScriptDocValues ? #2233