quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
8.01k stars 327 forks source link

Double sorting with aggregation not working #5120

Open djklim87 opened 3 months ago

djklim87 commented 3 months ago

Describe the bug

When we request collection with aggregation with sorting by two fields we see two bugs:

Steps to reproduce (if applicable)

  1. Download dataset and index config from https://dev2.manticoresearch.com/index-settings-and-data.zip
  2. Run Quickwit in Docker quickwit/quickwit:0.8.1
  3. Create index (config provided in attached archive):
    
    export HOST='http://localhost:7280'

curl -s -XPOST "${HOST}/api/v1/indexes" \ --header "content-type: application/yaml" \ --data-binary @./index-config.yaml

4. Upload data (Dataset is pretty big, so we split it into chunks):

split -l 10000 ./data.jsonl ./data_splitted.

echo "Starting loading" for f in ./data_splitted.*; do echo "Upload chunk $f" curl -s -XPOST "${HOST}/api/v1/hn_small/ingest?commit=force" --data-binary @$f rm $f done echo "Finished"

5. Perform query:

curl --location '${HOST}/api/v1/hn_small/search' \ --header 'Content-Type: application/json' \ --data '{"query":"*","max_hits":0,"aggs":{"comment_ranking_avg":{"terms":{"field":"comment_ranking","size":20,"order":{"avg_field":"desc","_key":"desc"}},"aggs":{"avg_field":{"avg":{"field":"author_comment_count"}}}}}}'

6. We got results with the wrong sorting

{ "num_hits": 1165439, "hits": [], "elapsed_time_micros": 6665, "errors": [], "aggregations": { "comment_ranking_avg": { "buckets": [ { "avg_field": { "value": 3504.0 }, "doc_count": 1, "key": 928.0 # Should be 2nd }, { "avg_field": { "value": 3504.0 }, "doc_count": 1, "key": 961.0 # Should be 1st }, { "avg_field": { "value": 3504.0 }, "doc_count": 1, "key": 730.0 }, .....

7. if you repeat the request several times it can return different results (for the same query)

{ "num_hits": 1165439, "hits": [], "elapsed_time_micros": 9610, "errors": [], "aggregations": { "comment_ranking_avg": { "buckets": [ { "avg_field": { "value": 64.0 }, "doc_count": 1, "key": 1305.0 }, { "avg_field": { "value": 117.0 }, "doc_count": 1, "key": 1296.0 }, { "avg_field": { "value": 40.0 }, "doc_count": 1, "key": 1289.0 }, { "avg_field": { "value": 87.0 }, "doc_count": 1, "key": 1287.0 }, ......


PS: Sometimes it returns results without grouping. In that case you should **reindex your dataset**

"buckets": [ { "avg_field": { "value": 3504.0 }, "doc_count": 1, "key": 961.0 }, { "avg_field": { "value": 3080.0 }, "doc_count": 1, "key": 980.0 }, { "avg_field": { "value": 3077.0 }, "doc_count": 1, "key": 1176.0 },


So generally we can get 3 different results for one query.

PS: Elasticsearch compatible URL has the same behaviour 

**Expected behavior**
It should return the dataset like provided below

{ "num_hits": 1165439, "hits": [], "elapsed_time_micros": 6665, "errors": [], "aggregations": { "comment_ranking_avg": { "buckets": [ { "avg_field": { "value": 3504.0 }, "doc_count": 1, "key": 961.0 }, { "avg_field": { "value": 3504.0 }, "doc_count": 1, "key": 928.0 }, { "avg_field": { "value": 3504.0 }, "doc_count": 1, "key": 730.0 }, .....


**Configuration:**
Please provide:

1. Output of `quickwit --version`

Quickwit 0.8.1 (aarch64-unknown-linux-gnu 2024-03-29T14:09:41Z e6c5396)

2. The index_config.yaml 

Provided in the attached archive)

PSeitz commented 3 months ago
{
  "query": "*",
  "max_hits": 0,
  "aggs": {
    "comment_ranking_avg": {
      "terms": {
        "field": "comment_ranking",
        "size": 20,
        "order": {
          "avg_field": "desc",
          "_key": "desc"
        }
      },
      "aggs": {
        "avg_field": {
          "avg": {
            "field": "author_comment_count"
          }
        }
      }
    }
  }
}

This is not a correct way to define the order. It should be:

"order": [ { "avg_field": "desc" }, { "_key":"desc" } ] 

But currently this is not supported, only sort by one field is supported currently.

djklim87 commented 3 months ago

Provided order is not working also, but it's still not implemented

curl --location 'http://127.0.0.1:7280/api/v1/hn_small/search' \
--header 'Content-Type: application/json' \
--data '{
    "query": "*",
    "max_hits": 0,
    "aggs": {
        "comment_ranking_avg": {
            "terms": {
                "field": "comment_ranking",
                "size": 20,
                "order": [
                    {
                        "avg_field": "desc"
                    },
                    {
                        "_key": "desc"
                    }
                ]
            },
            "aggs": {
                "avg_field": {
                    "avg": {
                        "field": "author_comment_count"
                    }
                }
            }
        }
    }
}'
{
    "message": "invalid aggregation request: invalid type: sequence, expected a map at line 1 column 180"
}

So with an order by one key, it works fine and gives the same results each call.

Probably you just should notice somewhere in docs that you support now only one argument for sorting.

fmassot commented 2 months ago

@PSeitz will the issue be closed with the merged PR https://github.com/quickwit-oss/quickwit/pull/5121 ?

PSeitz commented 2 months ago

There's also https://github.com/quickwit-oss/tantivy/pull/2451

But it's just covering error handling, not implementing order by multiple fields