[FEATURE] Improved multi vector support using Nested fields

vamshin commented 1 year ago

Is your feature request related to a problem? Related to https://github.com/opensearch-project/k-NN/issues/675

What solution would you like? Use Parent Join feature support to retrieve all the documents for a given query instead of using child documents resulting in fewer hits https://github.com/apache/lucene/pull/12434.

vamshin commented 1 year ago

Question:

1) Does this feature need any code changes from k-NN plugin to get the "Parent Join" feature or is it automatically available when we move to latest Lucene version?

2) Can this be leveraged for all the engines(Lucene, faiss, nmslib)? From my understanding its agnostic to engine, but lets confirm

heemin32 commented 1 year ago

Question:

Does this feature need any code changes from k-NN plugin to get the "Parent Join" feature or is it automatically available when we move to latest Lucene version?

It needs a code change from k-NN plugin to adapt the feature. Lucene introduced a new Query type ToParentBlockJoinByteKnnVectorQuery and ToParentBlockJoinFloatKnnVectorQuery. For join query of type has_child, we need to return those query instead of KnnFloatVectorQuery and KnnFloatVectorQuery which we are using now.

One additional field is required for ToParentBlockJoin[Byte|Float]KnnVectorQuery that we need to pass, BitSetProducer parentsFilter. Need more investigation on how to get the value in k-NN plugin.

Can this be leveraged for all the engines(Lucene, faiss, nmslib)? From my understanding its agnostic to engine, but lets confirm

The feature works only for Lucene engine as Faiss and nmslib uses our own custom Query.

heemin32 commented 1 year ago

Expected behavior

1. Create knn field with lucene engine

PUT /multi-vector
{
    "settings": {
        "index": {
            "knn": true,
            "knn.algo_param.ef_search": 100
        }
    },
    "mappings": {
        "properties": {
            "nested_field": {
                "type": "nested",
                "properties": {
                    "my_vector1": {
                        "type": "knn_vector",
                        "dimension": 3,
                        "method": {
                            "name": "hnsw",
                            "space_type": "l2",
                            "engine": "lucene",
                            "parameters": {
                                "ef_construction": 128,
                                "m": 24
                            }
                        }
                    }
                }
            }
        }
    }
}

2. Index data

PUT /_bulk?refresh=true
{ "index": { "_index": "multi-vector", "_id": "1" } }
{"nested_field":[{"my_vector1":[1,1,1]},{"my_vector1":[2,2,2]},{"my_vector1":[3,3,3]}]}
{ "index": { "_index": "multi-vector", "_id": "2" } }
{"nested_field":[{"my_vector1":[10,10,10]},{"my_vector1":[20,20,20]},{"my_vector1":[30,30,30]}]}

3. Query data

GET /multi-vector/_search
{
  "query": {
    "nested": {
      "path": "nested_field",
      "query": {
        "knn": {
          "nested_field.my_vector1": {
            "vector": [1,1,1],
            "k": 2
          }
        }
      }
    }
  }
}

4. Should return two documents (Current implementation returns 1 document)

{
    "took": 23,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "multi-vector",
                "_id": "1",
                "_score": 1.0,
                "_source": {
                    "nested_field": [
                        {
                            "my_vector1": [
                                1,
                                1,
                                1
                            ]
                        },
                        {
                            "my_vector1": [
                                2,
                                2,
                                2
                            ]
                        },
                        {
                            "my_vector1": [
                                3,
                                3,
                                3
                            ]
                        }
                    ]
                }
            },
            {
                "_index": "multi-vector",
                "_id": "2",
                "_score": 0.0040983604,
                "_source": {
                    "nested_field": [
                        {
                            "my_vector1": [
                                10,
                                10,
                                10
                            ]
                        },
                        {
                            "my_vector1": [
                                20,
                                20,
                                20
                            ]
                        },
                        {
                            "my_vector1": [
                                30,
                                30,
                                30
                            ]
                        }
                    ]
                }
            }
        ]
    }
}

dylan-tong-aws commented 9 months ago

@heemin32 @vamshin, if someone intentionally modeled multiple documents as vectors in a nested field, this change would break their application, correct?

Or, is there a configuration to modify the behavior?

heemin32 commented 9 months ago

It won't. After the change, it might return k results when it returned less than k results before. If the result was more than k before, the result will be same even after this change.

dylan-tong-aws commented 9 months ago

Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus.

I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct?

Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?

dylan-tong-aws commented 9 months ago

Also, can you please provide an example of how to use this feature with neural search? Specifically, given a nested field of strings, how can I construct a nested field of vectors using the text embedding processor.

heemin32 commented 9 months ago

That is correct.

Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus.

I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct?

Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?

This is correct.

dylan-tong-aws commented 9 months ago

That is correct.

Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus. I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct? Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?

This is correct.

Ok, so let's say I intentionally data modeled my application around the first scenario. If I upgrade to 2.12, would it not break my app? Pre 2.12 I get one doc. I upgrade now I get two docs in the result. My app functionality has changed.

heemin32 commented 9 months ago

That is correct.

Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus. I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct? Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?

This is correct.

Ok, so let's say I intentionally data modeled my application around the first scenario. If I upgrade to 2.12, would it not break my app? Pre 2.12 I get one doc. I upgrade now I get two docs in the result. My app functionality has changed.

I won't say it is a breaking of an app. It is a wrong way of using the nested field. If you only rely on k value, its behavior is non-deterministic. For example, in the above example, if there are two segments and doc1 is in segment1 and doc2 is in segment2, with k=2, you will get both documents as results even before this change.

heemin32 commented 9 months ago

Also, can you please provide an example of how to use this feature with neural search? Specifically, given a nested field of strings, how can I construct a nested field of vectors using the text embedding processor.

The question should be asked in neural search repo. There is a GH issue for it. https://github.com/opensearch-project/neural-search/issues/482

dylan-tong-aws commented 9 months ago

That is correct.

Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus. I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct? Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?

This is correct.

Ok, so let's say I intentionally data modeled my application around the first scenario. If I upgrade to 2.12, would it not break my app? Pre 2.12 I get one doc. I upgrade now I get two docs in the result. My app functionality has changed.

I won't say it is a breaking of an app. It is a wrong way of using the nested field. If you only rely on k value, its behavior is non-deterministic. For example, in the above example, if there are two segments and doc1 is in segment1 and doc2 is in segment2, with k=2, you will get both documents as results even before this change.

One could debate what's a good data model, but there could be valid reasons for electing this data modeling design. Regardless of whether the user made a good data modeling decision, we don't govern or restrict users from being able to design their data model in either way.

I suggest we have an index configuration like "nested_vector_mode" = SINGLE | MULTI. It could be defaulted to "SINGLE". At least someone has the option to change the config to "MULTI" in case this causes a breaking change.

heemin32 commented 9 months ago

The meaning of k parameter is not the size of result. You need to pass size parameter to limit the number of final result of your query. In short, we are increasing recall for nested field search. For example, let's say you requested to get 10 nearest vector and we returned only 2 result even if there are 5 vectors available. If we enhanced and return 5 vectors now, will it be regarded as breaking change?

dylan-tong-aws commented 9 months ago

Can you think of a scenario where the ranking is changed? Before we only return 2 results because the 5 most similar things are in those two documents. Now we return 5. Can you think of a scenario where the 2 original results might end up being ranked lower after the change?

heemin32 commented 9 months ago

Can you think of a scenario where the ranking is changed? Before we only return 2 results because the 5 most similar things are in those two documents. Now we return 5. Can you think of a scenario where the 2 original results might end up being ranked lower after the change?

No such case unless user rerank on the returned result.

asfoorial commented 7 months ago

Hi all,

Two questions here,

does rerank work on nested text fields?
If parent join is used internally for this feature, would it affect indices that already use parent join?

Thanks

heemin32 commented 7 months ago

Hi all,

Two questions here,

does rerank work on nested text fields?

If parent join is used internally for this feature, would it affect indices that already use parent join?

Thanks

Nested text fields is not supported in rerank processor.
Yes. An existing index with nested field will get benefitted from this implementation without reindexing.

asfoorial commented 7 months ago

Hi all, Two questions here,

does rerank work on nested text fields?

If parent join is used internally for this feature, would it affect indices that already use parent join?

Thanks

Nested text fields is not supported in rerank processor.

Yes. An existing index with nested field will get benefitted from this implementation without reindexing.

I have an index that already uses parent join, would that conflict with this feature? As far as I know that an index can have only one parent join field.

heemin32 commented 7 months ago

It won't conflict with this feature. This feature does not use parent join internally.

opensearch-project / k-NN