opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
121 stars 140 forks source link

[BUG] MAX/MIN functions does not work properly with array values #3138

Open normanj-bitquill opened 3 weeks ago

normanj-bitquill commented 3 weeks ago

What is the bug? If MAX or MIN are applied to a field with array values, then the MAX or MIN of any element in any of the array values is returned.

Consider an index with the following data:

{"x": 1, "y": [1, 2]}
{"x": 2, "y": [3, 4]}
{"x": 3, "y": [1, 5]}
{"x": 4, "y": [1, 2]}
{"x": 5, "y": [2, 3]}

and this query:

SELECT MAX(y) FROM test3;
5

or this query:

SELECT MIN(y) FROM test3;
1

For MAX, the expectd result is [3, 4] and for MIN, the expected result is [1, 2]. The comparison should be performed element by element and preserve the entire array value.

How can one reproduce the bug? Steps to reproduce the behavior:

  1. Create a new index and load the data above
  2. Run the query above

What is the expected behavior? Use array comparisons to compare the array values and return the MAX or MIN array.

What is your host/environment?

Do you have any screenshots? N/A

Do you have any additional context? Issue #1300 had a change recently merged in that allows array values to be used in query evaluation and in the result set.

normanj-bitquill commented 3 weeks ago

This appears to be due to how the OpenSearch engine calculates the min and max for arrays values.

POST test3/_search
{
  "size": 0,
  "aggs": {
    "y_min": {
      "min": {
        "field": "y"
      }
    }
  }
}
{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "y_min": {
      "value": 1
    }
  }
}
POST test3/_search
{
  "size": 0,
  "aggs": {
    "y_max": {
      "max": {
        "field": "y"
      }
    }
  }
}
{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "y_max": {
      "value": 5
    }
  }
}
normanj-bitquill commented 3 weeks ago

For reference, this is how PostgreSQL behaves:

SELECT * FROM with_array;
 x |   y
---+-------
 1 | {1,2}
 2 | {3,4}
 3 | {1,5}
 4 | {1,2}
(4 rows)
SELECT MIN(y) FROM with_array;
  min
-------
 {1,2}
(1 row)
SELECT MAX(y) FROM with_array;
  max
-------
 {3,4}
(1 row)
penghuo commented 3 weeks ago

Thanks for rasing issue. OpenSearch/Lucene does not support ARRAY data type. Lucene allows adding multiple values for the same field name. for example. when calcualte min/max of field y, OpenSeach read all the value of y, and calcuate metrics, the result is min(y)=1, max(y)=5.

field,value
y,1
y,2
y,3
y,4
y,1
y,5
y,1
y,2
y,2
y,3
penghuo commented 3 weeks ago

Related issue in opensearch-core. https://github.com/opensearch-project/OpenSearch/issues/16420

penghuo commented 3 weeks ago

One solution is if array_field is used in aggregation, we should do post-processing, instead of rewrite as DSL aggregation query.

normanj-bitquill commented 3 weeks ago

@penghuo I'm not sure the SQL plugin can easily know that the field is an array field. By default, the mapping will only contain information about the element types.

GET my_index/mapping
{
  "properties": {
    "id": {
      "type": "long"
    },
    "array_field": {
      "type": "long"
    }
  }
}

If the user adds something to the mapping, we could know that it is an array. https://trino.io/docs/current/connector/elasticsearch.html#array-types

Is there anything similar currently in OpenSearch?

penghuo commented 3 weeks ago

@normanj-bitquill agree, https://github.com/opensearch-project/OpenSearch/issues/16420 discuss similar approach.

andrross commented 1 week ago

[Catch All Triage - 1, 2, 3, 4, 5]