opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
154 stars 113 forks source link

[FEATURE] add support for more than one kNN query on nested vectors with multiple inner hits and filter #1768

Open konstadin opened 3 months ago

konstadin commented 3 months ago

Is your feature request related to a problem? Yes, I want to create a document with more than one nested vector in a single document (nested / nested vector), query the document with multiple k-NN queries, gather more than one inner_hit when searching nested k-NN for each query. This feature is available in Elasticsearch and require parity with Opensearch.

What solution would you like?

Expand support for kNN search with nested fields to allow for multiple knn queries.

This solution builds on Enhanced multi-vector support for OpenSearch k-NN search with nested fields.

Instead of one k-NN search with nested fields on a doc, the solution supports:

The response returns:

What alternatives have you considered? Storing the documents as nested vectors (instead of nested / nested vectors) and using a boolean query with multiple k-NN queries with an aggregation. However, the mapping of which field matched which k-NN query is lost in the aggregation, as are inner hits. The _score racking is questionable if it will be calculated the same way.

Do you have any additional context?

Consider example of storing lines for each paragraph, for each chapter, in a book. Attached is an example mapping, where the lines are stored as nested embeddings in vector and paragraphs are nested in embeddings. Essentially each document stores paragraphs to a chapter, to a book; a document is a collection of paragraphs for a chapter.

  "mappings": {
    "properties": {
      "book_id": { "type": "keyword" },
      "chapter_id": { "type": "keyword" },
    "paragraph": {
        "type": "nested",
        "properties": {
      "paragraph_id": { "type": "keyword" },
      "embeddings": {
            "type": "nested",
            "properties": {
          "line_id": { "type": "keyword" },
              "vector": {
                "type": "dense_vector",
                "index": true,
        "dims": 384,
                "similarity": "cosine"
              }
            }
          }
        }
      }
    }
  }

We want to find the chapters, that have the closest matches to n lines of text, where each line of text represents a k-NN search (query_1, query_2) that will target the nested embeddings in vector.

We should have the the ability to filter for a specific book book_id, in this example 1234. This will filter out any unrelated books, and be applied as a pre-filter in the k-NN search and not as a post-filter.

Sample response included below that returns top 2 documents, with k=2 for each k-NN search.

"hits": {
  "max_score": 1.7332492,
    "hits": [
      {
        "_score": 1.7332492,
        "fields": {
          "book_id": [ "1234" ],
          "chapter_id": [ "chapter10" ]
        },
        "inner_hits": {
          "query_1": {
            "hits": {
              "max_score": 0.83575505,
                "hits": [
                  {
                    "_score": 0.83575505,
                    "fields": {
                      "paragraph.embeddings": [{"paragraph_id": [ "p_1" ], "line_id": [ "line_3" ]}]
                    }
                  },
                  {
                    "_score": 0.0333445,
                    "fields": {
                      "paragraph.embeddings": [{"paragraph_id": [ "p_1" ], "line_id": [ "line_6" ]}]
                    }
                  }
                ]
            }
          },
          "query_2": {
            "hits": {
              "max_score": 0.8974941,
                "hits": [
                  {
                    "_score": 0.8974941,
                    "fields": {
                      "paragraph.embeddings": [{"paragraph_id": [ "p_3" ], "line_id": [ "line_5" ]}]
                    }
                  },
                  {
                    "_score": 0.55534545,
                    "fields": {
                      "paragraph.embeddings": [{"paragraph_id": [ "p_3" ], "line_id": [ "line_8" ]}]
                    }
                  }
                ]
            }
          }
        }
      },
      {
        "_score": 0.8735112,
        "fields": {
          "book_id": [ "1234" ],
          "chapter_id": [ "chapter3" ]
        },
        "inner_hits": {
          "query_1": {
            "hits": {
              "max_score": null,
                "hits": []
            }
          },
          "query_2": {
            "hits": {
              "max_score": 0.8735112,
                "hits": [
                  {
                    "_score": 0.8735112,
                    "fields": {
                      "paragraph.embeddings": [{"paragraph_id": [ "p_7" ], "line_id": [ "line_56" ]}]
                    }
                  },
                  {
                    "_score": 0.03553,
                    "fields": {
                      "paragraph.embeddings": [{"paragraph_id": [ "p_7" ], "line_id": [ "line_88" ]}]
                    }
                  }
                ]
            }
          }
        }
      }
  ]
}
konstadin commented 3 months ago

@heemin32 would it be possible to triage this enhancement and identify the timeline for a deliverable. We are currently blocked without this work and need to understand if and when this feature would be available.

heemin32 commented 3 months ago

Hi @konstadin. Is this functionality provided for text field? I think there is no such method to make two query and get innerHit result for each query even for text field. Could you also check if hybrid search could be used for your use case? https://opensearch.org/docs/latest/search-plugins/hybrid-search/

konstadin commented 3 months ago

Hi @heemin32. Not aware of functionality provided for text field. however it is available for multiple k-NN search.

Search multiple knn fields: Available in ES 8.12 -> https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#_search_multiple_knn_fields What are the plans for parity in OS v2.x?

Nested kNN Search with 1 Inner hits: Available in ES 8.12 -> https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#nested-knn-search-inner-hits

Nested kNN Search with multiple Inner hits: Available in ES 8.13 -> https://github.com/elastic/elasticsearch/pull/104006 What are the plans for parity in OS v2.x?

Filtered kNN search, applied as a pre-filter: Available in ES 8.12 -> https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#knn-search-filter-example Will there be parity in OS 2.15? -> https://github.com/opensearch-project/OpenSearch/pull/13903

The request is to provide parity for above; to search multiple knn fields on nested embeddings, and return more than 1 inner hit with filter.

navneet1v commented 3 months ago

@konstadin you can use a bool query with a should/must clause to search on multiple k-NN fields(nested or non nested doesn't matter). A k-nn query clause in Opensearch is just like any other query clause of Opensearch. it doesn't require any special treatment just like elastic has done. So its more like the way you will search on mutliple text fields you can do the same for k-NN query clause too.

POST <index-name>/_search
{
  "size": 10,
  "query": {
    "bool": {
      "should": [
        {
          "knn": {
            "my_vector2": {
              "vector": [
                2,
                3,
                5,
                6
              ],
              "k": 10
            }
          }
        },
        {
          "knn": {
            "my_vector1": {
              "vector": [
                2,
                3,
                5,
                6
              ],
              "k": 6
            }
          }
        }
      ]
    }
  }
}

Same goes for a nested field.

konstadin commented 3 months ago

Thanks @navneet1v @heemin32 will take a look.

What are the plans to provide feature to return multiple Inner hits? Available in ES 8.13 -> https://github.com/elastic/elasticsearch/pull/104006

navneet1v commented 3 months ago

Thanks @navneet1v @heemin32 will take a look.

What are the plans to provide feature to return multiple Inner hits? Available in ES 8.13 -> elastic/elasticsearch#104006

@heemin32 is this feature added in 2.15 release of opensearch?

heemin32 commented 3 months ago

It is not. This issue is somewhat related with https://github.com/opensearch-project/k-NN/issues/1743