opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
120 stars 139 forks source link

[META]PPL new `trendline` command #3011

Open YANG-DB opened 1 month ago

YANG-DB commented 1 month ago

Is your feature request related to a problem?

Adding a new PPL trendline command to support computing a moving averages of fields.

We would like to support two flavours of moving average:

**SMA : Simple moving average**

SMA(t) = (1/n) * Σ(f[i]), where i = t-n+1 to t


**WMA : Weighted moving average**

WMA(t) = Σ(w[i] * f[i]) / Σ(w[i]), where i = t-n+1 to t Where w[i] is the weight for the i-th data-point.

In a typical WMA, the weights are linearly decreasing from the most recent to the oldest data-point: w[i] = n - (t - i), where i = t-n+1 to t

The complete forumlation would be: WMA(t) = Σ((n - (t - i)) * f[i]) / Σ(n - (t - i)), where i = t-n+1 to t


Example

The next command shows a trendline over a 5 month period events by month

source=t | stats count(date_month) | trendline sma(5, count) AS trend | fields  trend

The next command would compute a 5-point simple moving average of the 'cpu_usage' field and store it in a new field called 'smooth_cpu'.

source=t| trendline sma(5,cpu_usage) as smooth_cpu

Multiple trendlines could be calculated in a single command, such as

| trendline sma(10,memory) as mem_trend wma(5,network_traffic) as net_trend.

Support for PPL trendline functionality is required for both:

- OpenSearch based PPL engine

Do you have any additional context? Add any other context or screenshots about the feature request here.

LantaoJin commented 1 month ago

Hi @YANG-DB , could you provide any background about distinguishing between trendline and stats? Based on the examples above, I don't get the key of difference between them.

YANG-DB commented 1 month ago

Hi 'stats' command only supports standard average without the sliding average window, we could add that support to 'stats' command but I think that separating the sliding window average from standard stats will simply the actual usage. In addition such unique commands for trends exist in other prominent pipeline languages

LantaoJin commented 1 month ago

My concern is there could be more window function related requests in PPL. I prefer to add a new command to support all of them instead of introducing specific smb and wma functions. My thoughts is adding a fundamental syntax for common propose which similar to streamstats. The example "compute a 5-point simple moving average of the 'cpu_usage' field and store it in a new field called 'smooth_cpu'." could be written to

| streamstats window=5 current=false global=false avg(cpu_usage) as smooth_cpu

or named trendstats

jduo commented 1 month ago

Hello, I've started the trendline PPL implementation. I've ported the lexer and parser changes over from the Spark PR.

Implementation-wise I think this should be very similar to how average is implemented, except the AggregationState should factor in the window specification. Does this seem like the right way to go?

One difference between this and average is that this is a root level command rather than a sub-command of stats

jduo commented 4 weeks ago

Also, noticed that the syntax suggested above differs from the SPL definition of trendline: https://docs.splunk.com/Documentation/SplunkCloud/latest/SearchReference/Trendline

Notably, in SPL, the period argument is sort of embedded into moving average type instead of being within the parentheses of the moving average type

YANG-DB commented 4 weeks ago

@jduo I would like to rethink about this

streamstats

streamstats is another function we can implement in a more generic manner - but still I see value to the trendline implementation in its own - @penghuo @ykmr1224 @dai-chen what do you think ?

YANG-DB commented 4 weeks ago

Also, noticed that the syntax suggested above differs from the SPL definition of trendline: https://docs.splunk.com/Documentation/SplunkCloud/latest/SearchReference/Trendline

Notably, in SPL, the period argument is sort of embedded into moving average type instead of being within the parentheses of the moving average type

@jduo thanks I'd like to reduce the complexity for the first iteration - it does resemble the stats command - maybe we can think of extending stats instead ? let me know what you think and if you have any specific suggestion here ?

jduo commented 4 weeks ago

Also, noticed that the syntax suggested above differs from the SPL definition of trendline: https://docs.splunk.com/Documentation/SplunkCloud/latest/SearchReference/Trendline Notably, in SPL, the period argument is sort of embedded into moving average type instead of being within the parentheses of the moving average type

@jduo thanks I'd like to reduce the complexity for the first iteration - it does resemble the stats command - maybe we can think of extending stats instead ? let me know what you think and if you have any specific suggestion here ?

@YANG-DB I made some good progress on the window-function like version of trendline yesterday before I saw this comment. I have parsing and logical planning working right. I'm going to continue down this path today to see if I can get execution working. I can put up a PR if you'd like to see or wait until execution works.

YANG-DB commented 4 weeks ago

@jduo thanks for the update Please continue and create the PR in draft mode

jduo commented 3 weeks ago

Should the alias part of the trendline command be mandatory? As a user I would expect its optional and to use the original field name if omitted. In SPL it is optional.