vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.79k stars 604 forks source link

Unexpected outcome when using a grouping query on double typed field #24892

Open PeriLara opened 1 year ago

PeriLara commented 1 year ago

Hi !

I was trying to get a distribution information on a double type field called quality_spamness and defined as is:

        field quality_spamness type double {
            indexing: summary | attribute
            attribute: fast-access
        }

The vespa environment I'm working on: OS: Debian Infra: self hosted Vespa version: 7.581.33

But I'm getting some weird results telling me that some hits with values of quality spamness < 1 belong the bucket between 1 and 2.

Here is the yql query:

"select quality_spamness  | all(group(predefined(quality_spamness, bucket[0,1>, bucket[1,2>,  bucket[2,inf>)) order(-count()) each(max(2) each(output(summary()))));"

With this query I want to get 2 hits by group

Here is what I get:

"children": [
  {
    "id": "group:root:0",
    "relevance": 1.0,
    "continuation": {
      "this": ""
    },
    "children": [
      {
        "id": "grouplist:predefined(quality_spamness, bucket[0, 1>, bucket[1, 2>, bucket[2, inf>)",
          "relevance": 1.0,
          "label": "predefined(quality_spamness, bucket[0, 1>, bucket[1, 2>, bucket[2, inf>)",
            "children": [
              {
                "id": "group:long_bucket:1:2",
                "relevance": 1.0,
                "limits": {
                  "from": "1",
                  "to": "2"
                },
                "children": [
                  {
                    "id": "hitlist:hits",
                    "relevance": 1.0,
                    "label": "hits",
                    "continuation": {
                      "next": "BKAAAAABGBEBC"
                    },
                    "children": [
                      {
                        "id": "index:mercury/0/5d278a68949c0ef92fa4ea1b",
                        "relevance": 0.0,
                        "source": "mercury",
                        "fields": {
                          "quality_spamness": 0.9576748013496399
                        }
                      },
                      {
                        "id": "index:mercury/0/00b000548ea9e239476a5eaf",
                        "relevance": 0.0,
                        "source": "mercury",
                        "fields": {
                          "quality_spamness": 0.8394812345504761
                        }
                      }
                    ]
                  }
                ]
              },
              {
                "id": "group:long_bucket:0:1",
                "relevance": 0.6666666666666666,
                "limits": {
                  "from": "0",
                  "to": "1"
                },
                "children": [
                  {
                    "id": "hitlist:hits",
                    "relevance": 1.0,
                    "label": "hits",
                    "continuation": {
                      "next": "BKAAABCABGBEBC"
                    },
                    "children": [
                      {
                        "id": "index:mercury/0/210600204378965182e72dda",
                        "relevance": 0.0,
                        "source": "mercury",
                        "fields": {
                          "quality_spamness": 0.0
                        }
                      },
                      {
                        "id": "index:mercury/0/210600e299b49c94df7b4248",
                        "relevance": 0.0,
                        "source": "mercury",
                        "fields": {
                          "quality_spamness": 0.0
                        }
                      }
                    ]
                  }
                ]
              },

As we can see, for the group [1;2[ the quality spamness of these hits does not correspond to the group they are associated with. for the group [0;1[ it seems that it does not take into account values in between, only values equal to 0.

If it is not clear, I can provide more information, let me know !

Thanks a lot!

jobergum commented 1 year ago

This looks like a bug @bjorncs?

bjorncs commented 1 year ago

Yes, this is clearly a bug.

bjorncs commented 1 year ago

@PeriLara Try to specify floating point values as limits in the predefined expression as a workaround: predefined(quality_spamness, bucket[0.0,1.0>, bucket[1.0,2.0>, bucket[2.0,inf>)

bjorncs commented 1 year ago

We have now identified the underlying problem.

A predefined bucket operator is given the type of the lower/upper bound values (double/long/raw/string). Each value evaluated from the predefined expression is converted to the bucket type before being assigned a bucket. In the above scenario the quality_spamness attribute value is bucket expression. Its values are rounded to the nearest integer, e,g 0.9576748013496399 is rounded up to 1.

The YQL query will therefore use the following mapping:

Using decimal bounds in bucket definitions gives the expected behaviour for quality_spamness.

The grouping query parser does not use type information when parsing the grouping expression to the intermediate AST. Ideally it should deduce the type of any expression given to predefined and select the correct bucket type based on that. An easier approach is to just handle expressions that is just an attribute.

PeriLara commented 1 year ago

Yes it seems to work this way, at least for the documents with quality_spamness < 1.0 (strictly)!

But isn't it counterintuitive since bucketing on float is not allowed? I mean, something like predefined(quality_spamness, bucket[0,0.5>, bucket[0.5,1.0>, bucket[1.0,inf>) returns a 400 Bad request error with this message: "message": "Could not create query from YQL: Bucket type mismatch, expected 'LongValue' got 'DoubleValue'.",

bjorncs commented 1 year ago

@PeriLara Make sure to specify decimal for all bounds (bucket[0,0.5> => bucket[0.0,0.5>).

PeriLara commented 1 year ago

@PeriLara Make sure to specify decimal for all bounds (bucket[0,0.5> => bucket[0.0,0.5>).

Yes it works ! Thanks ! :)

bjorncs commented 1 year ago

The grouping query documentation on predefined will be improved with https://github.com/vespa-engine/documentation/pull/2420.

jobergum commented 1 year ago

Doesn't the documentation update primarily address this? I don't think we can do anything with this in Vespa's behavior.