opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
117 stars 134 forks source link

[FEATURE]Extend ppl `stats` command functionality #3024

Open YANG-DB opened 2 weeks ago

YANG-DB commented 2 weeks ago

High level Review

The OpenSearch Piped Processing Language (PPL) currently lacks some advanced statistical aggregation capabilities similar to those provided by the eventstats command in Splunk Search Processing Language (SPL). This feature request proposes adding new functions and syntax to PPL to enable statistical calculations and aggregations on event data.

Proposed Functionality:

  1. Aggregate statistical calculations:

    • Calculate common statistical measures like sum, count, min, max, avg, etc., on specific fields or expressions.
    • Support grouping events by one or more fields and performing statistical calculations within each group.
    • Allow renaming the calculated fields with custom names.
  2. Conditional aggregations:

    • Perform statistical calculations based on conditional expressions or filters.
    • Evaluate conditional expressions for each event and aggregate the results (e.g., sum of a conditional expression).
  3. Chaining and nesting:

    • Enable chaining and nesting of statistical calculations, similar to how eventstats commands can be chained in SPL.
    • Allow performing multiple levels of aggregations and calculations in a single query.
  4. Integration with existing PPL syntax:

    • Seamlessly integrate the new statistical aggregation capabilities with the existing PPL syntax and functions.
    • Ensure compatibility with other PPL features and maintain the overall usability and readability of the language.

Examples:

  1. Calculate the sum of a conditional expression grouped by a field:
stats sum(if(field1 = "value" and field2 like "%pattern%", 1, 0)) as conditional_sum by group_field
  1. Calculate minimum and maximum values of a field grouped by another field:
stats min(latency_field) as min_latency, max(latency_field) as max_latency by operation_id
  1. Chain multiple statistical calculations:
stats sum(count) as total_count by client_id | stats sum(total_count) as overall_total