Describe the solution you'd like
We propose adding a new fieldsummary command to OpenSearch PPL that would provide summary statistics for all fields in the current result set.
This command should:
Calculate basic statistics for each field (count, distinct count, min, max, avg for numeric fields)
Determine the data type of each field
Show the most frequent values and their counts for each field
Calculate the percentage of events that contain each field
Additionally, the command should support the following key optional parameters:
includefields:
Specify which fields to include in the summary (e.g., | fieldsummary includefields="status_code,user_id,response_time")
excludefields:
Specify which fields to exclude from the summary (e.g., | fieldsummary excludefields="internal_id,debug_info")
topvalues:
Set the number of top values to display for each field (e.g., | fieldsummary topvalues=5)
maxfields:
Limit the number of fields to display (e.g., | fieldsummary maxfields=20)
nulls:
Include null/empty value counts (e.g., | fieldsummary nulls=true)
Example usage:
source = t
| where timestamp >= "2023-01-01" and timestamp < "2023-02-01"
| fieldsummary includefields="status_code,user_id,response_time" topvalues=3 nulls=true
This command would generate a table with summary statistics for the specified fields in the given date range, showing the top 3 values for each field and including null counts.
Describe the solution you'd like We propose adding a new
fieldsummary
command to OpenSearch PPL that would provide summary statistics for all fields in the current result set.This command should:
Additionally, the command should support the following key optional parameters:
| fieldsummary includefields="status_code,user_id,response_time"
)| fieldsummary excludefields="internal_id,debug_info"
)| fieldsummary topvalues=5
)| fieldsummary maxfields=20
)| fieldsummary nulls=true
)Example usage:
This command would generate a table with summary statistics for the specified fields in the given date range, showing the top 3 values for each field and including null counts.
Example output:
404 (1500, 15%)
500 (400, 4%)
user456 (95, 1%)
user789 (90, 0.9%)
0.75 (1800, 18%)
1.0 (1500, 15%)