Open LantaoJin opened 1 month ago
cc @dai-chen, @penghuo
Option 2 (more readable and widely used in OLAP engines)
I perfer option2
To align with the current percentiles(aggField) implementation in legacy engine and percentiles agg in OpenSearch core, for percentile we will use t-digest construction algorithm which is an approximate calculation.
Do u proposal add another aggregator? Currently, SQL aggregation framework executed query plan on coordination node (if it can not be push down to OpenSearch). In future, we want to leverage Spark aggregation framework, instead of re-invent on it.
For percentile_cont and percentile_disc, we could use org.apache.commons.math3:Percentile instead. For example, setting EstimationType R_1 for percentile_disc and R_7 for percentile_cont. This two R_x quantile algorithms are very popular, which used in Spark, PostgreSQL and Excel, etc. Reference: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html
Is it possible to implement it in OpenSearch Core? and SQL/PPL can leverage it.
Is your feature request related to a problem? This RFC is a part of this issue: https://github.com/opensearch-project/sql/issues/44 Currently, PPL can not answer question with percentile function. For example
returns
Unsupported aggregation function
SQL query with
percentiles
function works as below because that it fallbacks to legacy engine which is not support PPL, so does current JOIN syntax.PERCENTILE
is a common aggregate function which has a lot of cases needs in Visualization with PPL. Same feature request from community: https://github.com/opendistro-for-elasticsearch/sql/issues/1093What solution would you like?
Syntax in PPL
Option 1 (defined in current
OpenSearchPPLParser.g4
, but not implemented in code)Option 2 (more readable and widely used in OLAP engines)
Syntax in SQL
Basic
ANSI SQL (experimental)
adding OVER ([PARTITION BY expression])
For example, mainstream database supports
percentile_cont
andpercentile_disc
Postgresql https://www.postgresql.org/docs/9.4/functions-aggregate.html RedShift https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html Snowflake https://docs.snowflake.com/en/sql-reference/functions/percentile_disc.html Spark https://issues.apache.org/jira/browse/SPARK-37691Solution
To align with the current
percentiles(aggField)
implementation in legacy engine and percentiles agg in OpenSearch core, forpercentile
we will use t-digest construction algorithm which is an approximate calculation.For
percentile_cont
andpercentile_disc
, we could use org.apache.commons.math3:Percentile instead. For example, setting EstimationType R_1 forpercentile_disc
and R_7 forpercentile_cont
. This two R_x quantile algorithms are very popular, which used in Spark, PostgreSQL and Excel, etc. Reference: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.htmlWhat alternatives have you considered? No, due to
percentiles
in legacy SQL engine couldn't work in PPL.Do you have any additional context? Add any other context or screenshots about the feature request here.