opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
119 stars 138 forks source link

[RFC] Make aggregation statement compilation robust #2767

Open LantaoJin opened 4 months ago

LantaoJin commented 4 months ago

Is your feature request related to a problem? Aggregation statement, query with aggregate function, group-by clause that we've supported, grouping set, cube that we haven't supported, has many restrictions based on SQL standard. OpenSearch SQL adapts MySQL grammar as it initial grammar, but the query compilation check should be done in query resolution/analyzing. Current query compilation check is weak, especially for aggregation statement.

Here is the result comparison of query compilation check between OpenSearch SQL, Spark, Postgres and Oracle.

POST _bulk
{ "index" : { "_index" : "test_bulk1", "_id" : "1" } }
{ "name": "Clark", "query_id": 1 }
{ "index" : { "_index" : "test_bulk1", "_id" : "2" } }
{ "name": "Dave", "query_id": 2 }
{ "index" : { "_index" : "test_bulk1", "_id" : "3" } }
{ "name": "Ava", "query_id": 3 }
name query_id
Clark 1
Dave 2
Ava 3
ID Query
1 SELECT query_id FROM test_bulk1 GROUP BY name
2 SELECT query_id, max(query_id) FROM test_bulk1
3 SELECT query_id FROM test_bulk1 GROUP BY max(query_id)
4 SELECT max(query_id) FROM test_bulk1 GROUP BY 1
5 SELECT max(query_id) FILTER(WHERE query_id) FROM test_bulk1
6 SELECT max(query_id) FILTER(WHERE max(query_id)>10) FROM test_bulk1
7 SELECT * FROM test_bulk1 WHERE ROW_NUMBER() OVER(ORDER BY name) > 0
8 SELECT name, rank() OVER (PARTITION BY a) FROM test_bulk1
ID Description OS-SQL Spark Postgres Oracle
1 column not in group-by clause return null value AnalysisException: The non-aggregating expression "query_id" is based on columns which are not participating in the GROUP BY clause. ERROR: column "query_id" must appear in the GROUP BY clause or be used in an aggregate function ORA-00979: "QUERY_ID": must appear in the GROUP BY clause or be used in an aggregate function
2 group-by is missing Explicit GROUP BY clause is required because expression [query_id] contains non-aggregated column The query does not include a GROUP BY clause. Add GROUP BY or turn it into the window functions using OVER clauses ditto^ ORA-00937: not a single-group group function
3 aggregate functions in group-by Server side error during query execution Aggregate functions are not allowed in GROUP BY, but found max(query_id) ERROR: aggregate functions are not allowed in GROUP BY ORA-00934: group function is not allowed here
4 group-by position refers to aggregate function Server side error during query execution GROUP BY 1 refers to an expression that contains an aggregate function. Aggregate functions are not allowed in GROUP BY ditto^ return 3
5 non boolean filter in aggregate Client side error during query execution: [filter] must not be null FILTER expression is not of type boolean. It cannot be used in an aggregate function ERROR: argument of FILTER must be type boolean, not type integer ORA-00923: FROM keyword not found where expected
6 aggregate in aggregate-filter error Server side error during query execution FILTER expression contains aggregate. It cannot be used in an aggregate function ERROR: aggregate functions are not allowed in FILTER ditto^
7 window function in where-clause Server side error during query execution It is not allowed to use window functions inside WHERE clause ERROR: window functions are not allowed in WHERE ORA-30483: window functions are not allowed here
8 ranking window function misses order clause return Ava,1; Clark,1; Dave,1 Window function rank() requires window to be ordered, please add ORDER BY clause return Ava,1; Clark,1; Dave,1 missing ORDER BY expression in the window specification

We should fix above query compilation error to make aggregation more robust.

What solution would you like? This RFC is to address the weakness of query compilation for aggregation statement, including query with aggregate function, window function, group-by clause etc. We will align with semantics of Spark and PostgresSQL as possible as we can. If the behaviours between Spark and PostgresSQL are different, we will choose the option which one could mitigate the impact or reduces breaking change.

What alternatives have you considered? A clear and concise description of any alternative solutions or features you've considered.

Do you have any additional context? Add any other context or screenshots about the feature request here.

dblock commented 3 months ago

[Catch All Triage - 1, 2]