[RFC] Make aggregation statement compilation robust

Is your feature request related to a problem? Aggregation statement, query with aggregate function, group-by clause that we've supported, grouping set, cube that we haven't supported, has many restrictions based on SQL standard. OpenSearch SQL adapts MySQL grammar as it initial grammar, but the query compilation check should be done in query resolution/analyzing. Current query compilation check is weak, especially for aggregation statement.

Here is the result comparison of query compilation check between OpenSearch SQL, Spark, Postgres and Oracle.

Example data:

POST _bulk
{ "index" : { "_index" : "test_bulk1", "_id" : "1" } }
{ "name": "Clark", "query_id": 1 }
{ "index" : { "_index" : "test_bulk1", "_id" : "2" } }
{ "name": "Dave", "query_id": 2 }
{ "index" : { "_index" : "test_bulk1", "_id" : "3" } }
{ "name": "Ava", "query_id": 3 }

name	query_id
Clark	1
Dave	2
Ava	3

Example queries:

ID	Query
1	SELECT query_id FROM test_bulk1 GROUP BY name
2	SELECT query_id, max(query_id) FROM test_bulk1
3	SELECT query_id FROM test_bulk1 GROUP BY max(query_id)
4	SELECT max(query_id) FROM test_bulk1 GROUP BY 1
5	SELECT max(query_id) FILTER(WHERE query_id) FROM test_bulk1
6	SELECT max(query_id) FILTER(WHERE max(query_id)>10) FROM test_bulk1
7	SELECT * FROM test_bulk1 WHERE ROW_NUMBER() OVER(ORDER BY name) > 0
8	SELECT name, rank() OVER (PARTITION BY a) FROM test_bulk1

Result of query compilation:

ID	Description	OS-SQL	Spark	Postgres	Oracle
1	column not in group-by clause	return null value	AnalysisException: The non-aggregating expression "query_id" is based on columns which are not participating in the GROUP BY clause.	ERROR: column "query_id" must appear in the GROUP BY clause or be used in an aggregate function	ORA-00979: "QUERY_ID": must appear in the GROUP BY clause or be used in an aggregate function
2	group-by is missing	Explicit GROUP BY clause is required because expression [query_id] contains non-aggregated column	The query does not include a GROUP BY clause. Add GROUP BY or turn it into the window functions using OVER clauses	ditto^	ORA-00937: not a single-group group function
3	aggregate functions in group-by	Server side error during query execution	Aggregate functions are not allowed in GROUP BY, but found max(query_id)	ERROR: aggregate functions are not allowed in GROUP BY	ORA-00934: group function is not allowed here
4	group-by position refers to aggregate function	Server side error during query execution	GROUP BY 1 refers to an expression that contains an aggregate function. Aggregate functions are not allowed in GROUP BY	ditto^	return `3`
5	non boolean filter in aggregate	Client side error during query execution: [filter] must not be null	FILTER expression is not of type boolean. It cannot be used in an aggregate function	ERROR: argument of FILTER must be type boolean, not type integer	ORA-00923: FROM keyword not found where expected
6	aggregate in aggregate-filter error	Server side error during query execution	FILTER expression contains aggregate. It cannot be used in an aggregate function	ERROR: aggregate functions are not allowed in FILTER	ditto^
7	window function in where-clause	Server side error during query execution	It is not allowed to use window functions inside WHERE clause	ERROR: window functions are not allowed in WHERE	ORA-30483: window functions are not allowed here
8	ranking window function misses order clause	return `Ava,1; Clark,1; Dave,1`	Window function rank() requires window to be ordered, please add ORDER BY clause	return `Ava,1; Clark,1; Dave,1`	missing ORDER BY expression in the window specification

We should fix above query compilation error to make aggregation more robust.

What solution would you like? This RFC is to address the weakness of query compilation for aggregation statement, including query with aggregate function, window function, group-by clause etc. We will align with semantics of Spark and PostgresSQL as possible as we can. If the behaviours between Spark and PostgresSQL are different, we will choose the option which one could mitigate the impact or reduces breaking change.

What alternatives have you considered? A clear and concise description of any alternative solutions or features you've considered.

Do you have any additional context? Add any other context or screenshots about the feature request here.

opensearch-project / sql

[RFC] Make aggregation statement compilation robust #2767