[FEATURE] inner_hits in nested neural query should return all the chunks

yuye-aws commented 2 weeks ago

What is the bug?

I am using text_chunking and text_embedding processor to ingest documents into an index. The text_chunking search example works well, but the inner_hits only returns a single element from the chunked string list. It does not matter when I set the score_mode to max or avg.

How can one reproduce the bug?

Register a text embedding model.

Create text chunking and embedding pipeline

PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline
{
"description": "A text chunking and embedding ingest pipeline",
"processors": [
{
  "text_chunking": {
    "algorithm": {
      "fixed_token_length": {
        "token_limit": 10,
        "overlap_rate": 0.2,
        "tokenizer": "standard"
      }
    },
    "field_map": {
      "passage_text": "passage_chunk"
    }
  }
},
{
  "text_embedding": {
    "model_id": "6ipW4JEBXVV1cW1lcFvy",
    "field_map": {
      "passage_chunk": "passage_chunk_embedding"
    }
  }
}
]
}

Create an index with mapping

PUT testindex
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "passage_text": {
        "type": "text"
      },
      "passage_chunk_embedding": {
        "type": "nested",
        "properties": {
          "knn": {
            "type": "knn_vector",
            "dimension": 768,
            "method": {
              "name": "hnsw",
              "engine": "lucene"
            }
          }
        }
      }
    }
  }
}

Ingest some sample documents into the index (run the following command two times).

POST testindex/_doc?pipeline=text-chunking-embedding-ingest-pipeline
{
  "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
}

Search the index with nested neural query

GET testindex/_search
{
  "query": {
    "nested": {
      "score_mode": "max",
      "path": "passage_chunk_embedding",
      "query": {
        "neural": {
          "passage_chunk_embedding.knn": {
            "query_text": "document",
            "model_id": "6ipW4JEBXVV1cW1lcFvy"
          }
        }
      },
      "inner_hits": {}
    }
  }
}

Receive the search result

{
    "took": 1361,
    "timed_out": false,
    "_shards": {
      "total": 1,
      "successful": 1,
      "skipped": 0,
      "failed": 0
    },
    "hits": {
      "total": {
        "value": 2,
        "relation": "eq"
      },
      "max_score": 0.02276505,
      "hits": [
        {
          "_index": "testindex",
          "_id": "7SqB4JEBXVV1cW1lKVvd",
          "_score": 0.02276505,
          "_source": {
            "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch.",
            "passage_chunk": [
              "This is an example document to be chunked. The document ",
              "The document contains a single paragraph, two sentences and 24 ",
              "and 24 tokens by standard tokenizer in OpenSearch."
            ],
            "passage_chunk_embedding": [
              {
                "knn": [ ... ]
              },
              {
                "knn": [ ... ]
              },
              {
                "knn": [ ... ]
              }
            ]
          },
          "inner_hits": {
            "passage_chunk_embedding": {
              "hits": {
                "total": {
                  "value": 1,
                  "relation": "eq"
                },
                "max_score": 0.02276505,
                "hits": [
                  {
                    "_index": "testindex",
                    "_id": "7SqB4JEBXVV1cW1lKVvd",
                    "_nested": {
                      "field": "passage_chunk_embedding",
                      "offset": 1
                    },
                    "_score": 0.02276505,
                    "_source": {
                      "knn": [ ... ]
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "_index": "testindex",
          "_id": "7iqB4JEBXVV1cW1l5lv_",
          "_score": 0.02276505,
          "_source": {
            "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch.",
            "passage_chunk": [
              "This is an example document to be chunked. The document ",
              "The document contains a single paragraph, two sentences and 24 ",
              "and 24 tokens by standard tokenizer in OpenSearch."
            ],
            "passage_chunk_embedding": [
              {
                "knn": [ ... ]
              },
              {
                "knn": [ ... ]
              },
              {
                "knn": [ ... ]
              }
            ]
          },
          "inner_hits": {
            "passage_chunk_embedding": {
              "hits": {
                "total": {
                  "value": 1,
                  "relation": "eq"
                },
                "max_score": 0.02276505,
                "hits": [
                  {
                    "_index": "testindex",
                    "_id": "7iqB4JEBXVV1cW1l5lv_",
                    "_nested": {
                      "field": "passage_chunk_embedding",
                      "offset": 1
                    },
                    "_score": 0.02276505,
                    "_source": {
                      "knn": [ ... ]
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }

What is the expected behavior?

The inner_hits should return matching score and offset of all the retrieved documents.

What is your host/environment?

Mac OS

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

Add any other context about the problem.

yuye-aws commented 2 weeks ago

Neural search with explain is not working. I could not find a workaround.

martin-gaievski commented 2 weeks ago

@yuye-aws Inner hits are not supported in hybrid query. There is a feature request for this (https://github.com/opensearch-project/neural-search/issues/718), but at the moment there is no path forward

yuye-aws commented 2 weeks ago

I'm not using hybrid query, just a plain neural query.

yuye-aws commented 2 weeks ago

Are both features not supported due to the same blocking issue?

martin-gaievski commented 2 weeks ago

Sorry, my bad. Neural query is different, I'm not sure why nested doesn't work, in the code of neural we delegate execution to knn query, so you may want to check how it's done in knn. Easy test would be to try if plain knn query supports "nested" clause

yuye-aws commented 2 weeks ago

Easy test would - try if plain knn query supports "nested" clause

Already tried in my fifth step.

martin-gaievski commented 2 weeks ago

Easy test would - try if plain knn query supports "nested" clause

Already tried in my fifth step.

In step 5 you do have neural query. I mean the knn query, something like in following example but with nested:

"query": {
        "knn": {
            "embedding_field": {
                "vector": [
                    5.0,
                    4.0,
                    ....
                    3.8
                ],
                "k": 12
            }
        }
    }

martin-gaievski commented 2 weeks ago

@yuye-aws I found this change in knn https://github.com/opensearch-project/k-NN/pull/1182, the essense of it is: in case of nested documents we need to return only one that gave the max score, and drop others. It became new default behavior instead of old one where all nested docs (meaning inner hits) are returned. From knn it's inherited by neural query.

yuye-aws commented 2 weeks ago

in case of nested documents we need to return only one that gave the max score, and drop others. It became new default behavior instead of old one where all nested docs (meaning inner hits) are returned.

This does not make sense, because the score_mode can also be avg, where we expect to see all the scores.

yuye-aws commented 2 weeks ago

From knn it's inherited by neural query.

Shall we make a PR to knn repo? After all, nested k-NN query also needs avg score mode.

heemin32 commented 2 weeks ago

@yuye-aws Please add your use case and also suggestion if you have regarding avg score mode support in knn. https://github.com/opensearch-project/k-NN/issues/1743

yuye-aws commented 2 weeks ago

Replied in https://github.com/opensearch-project/k-NN/issues/1743#issuecomment-2347925588. Also, resolving this issue can help resolve a user issue: https://github.com/opensearch-project/ml-commons/issues/2612. I was considering to implement a new search response processor to retrieved most relevant chunks, but is fortunately blocked by the current issue: https://github.com/opensearch-project/ml-commons/issues/2612#issuecomment-2343152694

hagen6835 commented 3 days ago

Would love this!

opensearch-project / k-NN