opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
117 stars 134 forks source link

[FEATURE]New `fieldsummary` PPL command #3026

Open YANG-DB opened 2 weeks ago

YANG-DB commented 2 weeks ago

Describe the solution you'd like We propose adding a new fieldsummary command to OpenSearch PPL that would provide summary statistics for all fields in the current result set.

This command should:

  1. Calculate basic statistics for each field (count, distinct count, min, max, avg for numeric fields)
  2. Determine the data type of each field
  3. Show the most frequent values and their counts for each field
  4. Calculate the percentage of events that contain each field

Additionally, the command should support the following key optional parameters:

  1. includefields: Specify which fields to include in the summary (e.g., | fieldsummary includefields="status_code,user_id,response_time")
  2. excludefields: Specify which fields to exclude from the summary (e.g., | fieldsummary excludefields="internal_id,debug_info")
  3. topvalues: Set the number of top values to display for each field (e.g., | fieldsummary topvalues=5)
  4. maxfields: Limit the number of fields to display (e.g., | fieldsummary maxfields=20)
  5. nulls: Include null/empty value counts (e.g., | fieldsummary nulls=true)

Example usage:

source = t
| where timestamp >= "2023-01-01" and timestamp < "2023-02-01"
| fieldsummary includefields="status_code,user_id,response_time" topvalues=3 nulls=true

This command would generate a table with summary statistics for the specified fields in the given date range, showing the top 3 values for each field and including null counts.

Example output:

Field Count Distinct Min Max Avg Type Top Values Nulls
status_code 10000 4 200 503 - short 200 (8000, 80%)
404 (1500, 15%)
500 (400, 4%)
0
user_id 9500 1200 - - - string user123 (100, 1%)
user456 (95, 1%)
user789 (90, 0.9%)
500
response_time 10000 986 0.01 10.5 0.75 float 0.5 (2000, 20%)
0.75 (1800, 18%)
1.0 (1500, 15%)
0