Closed vamshin closed 8 months ago
Question:
1) Does this feature need any code changes from k-NN plugin to get the "Parent Join" feature or is it automatically available when we move to latest Lucene version?
2) Can this be leveraged for all the engines(Lucene, faiss, nmslib)? From my understanding its agnostic to engine, but lets confirm
Question:
- Does this feature need any code changes from k-NN plugin to get the "Parent Join" feature or is it automatically available when we move to latest Lucene version?
It needs a code change from k-NN plugin to adapt the feature. Lucene introduced a new Query type ToParentBlockJoinByteKnnVectorQuery
and ToParentBlockJoinFloatKnnVectorQuery
. For join query of type has_child
, we need to return those query instead of KnnFloatVectorQuery
and KnnFloatVectorQuery
which we are using now.
One additional field is required for ToParentBlockJoin[Byte|Float]KnnVectorQuery
that we need to pass, BitSetProducer parentsFilter
. Need more investigation on how to get the value in k-NN plugin.
- Can this be leveraged for all the engines(Lucene, faiss, nmslib)? From my understanding its agnostic to engine, but lets confirm
The feature works only for Lucene engine as Faiss and nmslib uses our own custom Query.
PUT /multi-vector
{
"settings": {
"index": {
"knn": true,
"knn.algo_param.ef_search": 100
}
},
"mappings": {
"properties": {
"nested_field": {
"type": "nested",
"properties": {
"my_vector1": {
"type": "knn_vector",
"dimension": 3,
"method": {
"name": "hnsw",
"space_type": "l2",
"engine": "lucene",
"parameters": {
"ef_construction": 128,
"m": 24
}
}
}
}
}
}
}
}
PUT /_bulk?refresh=true
{ "index": { "_index": "multi-vector", "_id": "1" } }
{"nested_field":[{"my_vector1":[1,1,1]},{"my_vector1":[2,2,2]},{"my_vector1":[3,3,3]}]}
{ "index": { "_index": "multi-vector", "_id": "2" } }
{"nested_field":[{"my_vector1":[10,10,10]},{"my_vector1":[20,20,20]},{"my_vector1":[30,30,30]}]}
GET /multi-vector/_search
{
"query": {
"nested": {
"path": "nested_field",
"query": {
"knn": {
"nested_field.my_vector1": {
"vector": [1,1,1],
"k": 2
}
}
}
}
}
}
{
"took": 23,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "multi-vector",
"_id": "1",
"_score": 1.0,
"_source": {
"nested_field": [
{
"my_vector1": [
1,
1,
1
]
},
{
"my_vector1": [
2,
2,
2
]
},
{
"my_vector1": [
3,
3,
3
]
}
]
}
},
{
"_index": "multi-vector",
"_id": "2",
"_score": 0.0040983604,
"_source": {
"nested_field": [
{
"my_vector1": [
10,
10,
10
]
},
{
"my_vector1": [
20,
20,
20
]
},
{
"my_vector1": [
30,
30,
30
]
}
]
}
}
]
}
}
@heemin32 @vamshin, if someone intentionally modeled multiple documents as vectors in a nested field, this change would break their application, correct?
Or, is there a configuration to modify the behavior?
It won't. After the change, it might return k results when it returned less than k results before. If the result was more than k before, the result will be same even after this change.
Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus.
I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct?
Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?
Also, can you please provide an example of how to use this feature with neural search? Specifically, given a nested field of strings, how can I construct a nested field of vectors using the text embedding processor.
That is correct.
Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus.
I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct?
Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?
This is correct.
That is correct.
Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus. I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct? Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?
This is correct.
Ok, so let's say I intentionally data modeled my application around the first scenario. If I upgrade to 2.12, would it not break my app? Pre 2.12 I get one doc. I upgrade now I get two docs in the result. My app functionality has changed.
That is correct.
Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus. I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct? Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?
This is correct.
Ok, so let's say I intentionally data modeled my application around the first scenario. If I upgrade to 2.12, would it not break my app? Pre 2.12 I get one doc. I upgrade now I get two docs in the result. My app functionality has changed.
I won't say it is a breaking of an app. It is a wrong way of using the nested field. If you only rely on k value, its behavior is non-deterministic. For example, in the above example, if there are two segments and doc1 is in segment1 and doc2 is in segment2, with k=2, you will get both documents as results even before this change.
Also, can you please provide an example of how to use this feature with neural search? Specifically, given a nested field of strings, how can I construct a nested field of vectors using the text embedding processor.
The question should be asked in neural search repo. There is a GH issue for it. https://github.com/opensearch-project/neural-search/issues/482
That is correct.
Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus. I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct? Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?
This is correct.
Ok, so let's say I intentionally data modeled my application around the first scenario. If I upgrade to 2.12, would it not break my app? Pre 2.12 I get one doc. I upgrade now I get two docs in the result. My app functionality has changed.
I won't say it is a breaking of an app. It is a wrong way of using the nested field. If you only rely on k value, its behavior is non-deterministic. For example, in the above example, if there are two segments and doc1 is in segment1 and doc2 is in segment2, with k=2, you will get both documents as results even before this change.
One could debate what's a good data model, but there could be valid reasons for electing this data modeling design. Regardless of whether the user made a good data modeling decision, we don't govern or restrict users from being able to design their data model in either way.
I suggest we have an index configuration like "nested_vector_mode" = SINGLE | MULTI. It could be defaulted to "SINGLE". At least someone has the option to change the config to "MULTI" in case this causes a breaking change.
The meaning of k
parameter is not the size of result. You need to pass size
parameter to limit the number of final result of your query.
In short, we are increasing recall for nested field search. For example, let's say you requested to get 10 nearest vector and we returned only 2 result even if there are 5 vectors available. If we enhanced and return 5 vectors now, will it be regarded as breaking change?
Can you think of a scenario where the ranking is changed? Before we only return 2 results because the 5 most similar things are in those two documents. Now we return 5. Can you think of a scenario where the 2 original results might end up being ranked lower after the change?
Can you think of a scenario where the ranking is changed? Before we only return 2 results because the 5 most similar things are in those two documents. Now we return 5. Can you think of a scenario where the 2 original results might end up being ranked lower after the change?
No such case unless user rerank on the returned result.
Hi all,
Two questions here,
Thanks
Hi all,
Two questions here,
- does rerank work on nested text fields?
- If parent join is used internally for this feature, would it affect indices that already use parent join?
Thanks
Hi all, Two questions here,
- does rerank work on nested text fields?
- If parent join is used internally for this feature, would it affect indices that already use parent join?
Thanks
- Nested text fields is not supported in rerank processor.
- Yes. An existing index with nested field will get benefitted from this implementation without reindexing.
I have an index that already uses parent join, would that conflict with this feature? As far as I know that an index can have only one parent join field.
It won't conflict with this feature. This feature does not use parent join internally.
Is your feature request related to a problem? Related to https://github.com/opensearch-project/k-NN/issues/675
What solution would you like? Use Parent Join feature support to retrieve all the documents for a given query instead of using child documents resulting in fewer hits https://github.com/apache/lucene/pull/12434.