opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
110 stars 129 forks source link

[RFC] Support Percentile in PPL #2670

Open LantaoJin opened 1 month ago

LantaoJin commented 1 month ago

Is your feature request related to a problem? This RFC is a part of this issue: https://github.com/opensearch-project/sql/issues/44 Currently, PPL can not answer question with percentile function. For example

"source=opensearch_dashboards_sample_data_ecommerce| stats percentile<50>(taxless_total_price)"

returns Unsupported aggregation function

"{\n  \"error\": {\n    \"reason\": \"Invalid Query\",\n    \"details\": \"Unsupported aggregation function percentile\",\n    \"type\": \"SemanticCheckException\"\n  },\n  \"status\": 400\n}"

SQL query with percentiles function works as below because that it fallbacks to legacy engine which is not support PPL, so does current JOIN syntax.

SELECT percentiles(taxful_total_price) FROM opensearch_dashboards_sample_data_ecommerce

PERCENTILE is a common aggregate function which has a lot of cases needs in Visualization with PPL. Same feature request from community: https://github.com/opendistro-for-elasticsearch/sql/issues/1093

What solution would you like?

Syntax in PPL

Option 1 (defined in current OpenSearchPPLParser.g4, but not implemented in code)

stats percentile<quantile>(aggField)
stats percentile_approx<quantile>(aggField)

Option 2 (more readable and widely used in OLAP engines)

stats percentile(aggField, quantile)
stats percentile_approx(aggField, quantile)

Syntax in SQL

Basic

PERCENTILE(functionArg, quantile)
PERCENTILE_APPROX(functionArg, quantile)

ANSI SQL (experimental)

PERCENTILE_CONT(quantile) WITHIN GROUP [ORDER BY expression [ASC | DESC]]
PERCENTILE_DISC(quantile) WITHIN GROUP [ORDER BY expression [ASC | DESC]]

For example, mainstream database supports percentile_cont and percentile_disc Postgresql https://www.postgresql.org/docs/9.4/functions-aggregate.html RedShift https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html Snowflake https://docs.snowflake.com/en/sql-reference/functions/percentile_disc.html Spark https://issues.apache.org/jira/browse/SPARK-37691

Solution

To align with the current percentiles(aggField) implementation in legacy engine and percentiles agg in OpenSearch core, for percentile we will use t-digest construction algorithm which is an approximate calculation.

For percentile_cont and percentile_disc, we could use org.apache.commons.math3:Percentile instead. For example, setting EstimationType R_1 for percentile_disc and R_7 for percentile_cont. This two R_x quantile algorithms are very popular, which used in Spark, PostgreSQL and Excel, etc. Reference: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html

What alternatives have you considered? No, due to percentiles in legacy SQL engine couldn't work in PPL.

Do you have any additional context? Add any other context or screenshots about the feature request here.

LantaoJin commented 1 month ago

PoC: https://github.com/LantaoJin/search-plugins-sql/tree/poc/percentile

LantaoJin commented 1 month ago

cc @dai-chen, @penghuo

penghuo commented 1 month ago

Option 2 (more readable and widely used in OLAP engines)

I perfer option2

To align with the current percentiles(aggField) implementation in legacy engine and percentiles agg in OpenSearch core, for percentile we will use t-digest construction algorithm which is an approximate calculation.

Do u proposal add another aggregator? Currently, SQL aggregation framework executed query plan on coordination node (if it can not be push down to OpenSearch). In future, we want to leverage Spark aggregation framework, instead of re-invent on it.

For percentile_cont and percentile_disc, we could use org.apache.commons.math3:Percentile instead. For example, setting EstimationType R_1 for percentile_disc and R_7 for percentile_cont. This two R_x quantile algorithms are very popular, which used in Spark, PostgreSQL and Excel, etc. Reference: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html

Is it possible to implement it in OpenSearch Core? and SQL/PPL can leverage it.