opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.03k stars 1.67k forks source link

[BUG] Aggregations for Hybrid Queries Return Incorrect Results or Fail #11379

Open jonwiggins opened 7 months ago

jonwiggins commented 7 months ago

Describe the bug Trying to do an aggregation on the results of a hybrid query is resulting in either 1) an error being raised 2) no results being returned.

To Reproduce Steps to reproduce the behavior:

  1. Create an index and hybrid query pipeline
    
    PUT /test-nlp-index
    {
    "settings": {
    "index.knn": false
    },
      "mappings": {
          "properties": {
              "vector": {
                  "type": "knn_vector",
                  "dimension": 3
              },
              "message": {"type": "text"},
              "number": {"type": "integer"}
          }
      }
    }
    {
    "acknowledged": true,
    "shards_acknowledged": true,
    "index": "test-nlp-index"
    }

PUT /_search/pipeline/test-nlp-search-pipeline { "description": "Test Post processor for hybrid search", "phase_results_processors": [ { "normalization-processor": { "normalization": { "technique": "l2" }, "combination": { "technique": "harmonic_mean", "parameters": { "weights": [ 0.5, 0.5 ] } } } } ] } { "acknowledged": true }

2. Insert some docs

PUT /test-nlp-index/_doc/1 { "message": "one two three", "number": 1, "vector": [0.1, 0.2, 0.3], "created_at": "2023-10-11" } PUT /test-nlp-index/_doc/2 { "message": "two three four", "number": 2, "vector": [0.2, 0.3, 0.4], "created_at": "2023-10-12" } PUT /test-nlp-index/_doc/3 { "message": "three four five", "number": 3, "vector": [0.3, 0.4, 0.5], "created_at": "2023-10-13" }

3. Query with aggregation based on a day via script score:

GET /test-nlp-index/_search?search_pipeline=test-nlp-search-pipeline { "size": 0, "query": { "script_score": { "min_score": 1, "query": { "bool": { "filter": [ { "range": { "number": { "gt": 0, "lt": 4 } } } ] } }, "script": { "source": "knn_score", "lang": "knn", "params": { "field": "vector", "query_value": [ -0.1, -0.2, 0.3 ], "space_type": "cosinesimil" } } } }, "aggs": { "count_per_day": { "date_histogram": { "format": "yyyy-MM-dd", "field": "created_at", "interval": "day" } } } } { "took": 15, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 3, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "count_per_day": { "buckets": [ { "key_as_string": "2023-10-11", "key": 1696982400000, "doc_count": 1 }, { "key_as_string": "2023-10-12", "key": 1697068800000, "doc_count": 1 }, { "key_as_string": "2023-10-13", "key": 1697155200000, "doc_count": 1 } ] } } }

4. Attempt to query with aggregation via hybrid query:

GET /test-nlp-index/_search?search_pipeline=test-nlp-search-pipeline { "size": 0, "query": { "hybrid": { "queries": [ { "bool": { "must": { "match": { "message": { "query": "one two three" } } } } }, { "script_score": { "min_score": 1, "query": { "bool": { "filter": [ { "range": { "number": { "gt": 0, "lt": 4 } } } ] } }, "script": { "source": "knn_score", "lang": "knn", "params": { "field": "vector", "query_value": [ -0.1, -0.2, 0.3 ], "space_type": "cosinesimil" } } } } ] } }, "aggs": { "count_per_day": { "date_histogram": { "format": "yyyy-MM-dd", "field": "created_at", "interval": "day" } } } } { "error": { "root_cause": [ { "type": "null_pointer_exception", "reason": "Cannot read field \"topDocs\" because \"topDocs\" is null" } ], "type": "search_phase_execution_exception", "reason": "all shards failed", "phase": "query", "grouped": true, "failed_shards": [ { "shard": 0, "index": "test-nlp-index", "node": "Lo9EC8U0TtK6aYca8HeKRQ", "reason": { "type": "null_pointer_exception", "reason": "Cannot read field \"topDocs\" because \"topDocs\" is null" } } ], "caused_by": { "type": "null_pointer_exception", "reason": "Cannot read field \"topDocs\" because \"topDocs\" is null", "caused_by": { "type": "null_pointer_exception", "reason": "Cannot read field \"topDocs\" because \"topDocs\" is null" } } }, "status": 500 }

5. Attempt the above query again with `size` > 0

GET /test-nlp-index/_search?search_pipeline=test-nlp-search-pipeline { "size": 5, "query": { "hybrid": { "queries": [ { "bool": { "must": { "match": { "message": { "query": "one two three" } } } } }, { "script_score": { "min_score": 1, "query": { "bool": { "filter": [ { "range": { "number": { "gt": 0, "lt": 4 } } } ] } }, "script": { "source": "knn_score", "lang": "knn", "params": { "field": "vector", "query_value": [ -0.1, -0.2, 0.3 ], "space_type": "cosinesimil" } } } } ] } }, "aggs": { "count_per_day": { "date_histogram": { "format": "yyyy-MM-dd", "field": "created_at", "interval": "day" } } } } { "took": 7, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 3, "relation": "eq" }, "max_score": 0.80178374, "hits": [ { "_index": "test-nlp-index", "_id": "1", "_score": 0.80178374, "_source": { "message": "one two three", "number": 1, "vector": [ 0.1, 0.2, 0.3 ], "created_at": "2023-10-11" } }, { "_index": "test-nlp-index", "_id": "2", "_score": 0.5345225, "_source": { "message": "two three four", "number": 2, "vector": [ 0.2, 0.3, 0.4 ], "created_at": "2023-10-12" } }, { "_index": "test-nlp-index", "_id": "3", "_score": 0.26726124, "_source": { "message": "three four five", "number": 3, "vector": [ 0.3, 0.4, 0.5 ], "created_at": "2023-10-13" } } ] }, "aggregations": { "count_per_day": { "buckets": [] } } }



**Expected behavior**
In the above step 4, the query should return without an error, containing the aggregation buckets.
In the above step 5, the query should return with values in the aggregation buckets.

**Plugins**
n/a

**Host/Environment (please complete the following information):**
 - OS: AWS
 - Version OpenSearch_2_11_R20231113-P1
peternied commented 7 months ago

Thanks for filing this issue

qmauret commented 3 months ago

I ran into the same problem. Any updates ?