influxdata / flux

Flux is a lightweight scripting language for querying databases (like InfluxDB) and working with data. It's part of InfluxDB 1.7 and 2.0, but can be run independently of those.
https://influxdata.com
MIT License
770 stars 153 forks source link

Feature request: basic statistics #4229

Closed wz2b closed 1 year ago

wz2b commented 2 years ago

I am trying to compute stddev(col) / mean(col) as an aggregation function using aggregateWindow(). I'm finding it's not that easy to do - I basically have to fetch the data, window and aggregate over it twice, then join the data back together. It occurred to me that it might not be a bad idea to have some kind of statistics(col) function that returns not just a single column but a few columns for the basic stats one might want: mean, max, min, stddev, mode, sum, and count - and return them all at once. These things could then be useful for when you need to later combine stats, including situations where you need to do a piecewise or rolling standard deviation. So I thought I would throw this out there as a feature request and see what people think.

wz2b commented 2 years ago

Today, a related topic came up. Someone wanted to compute Confidence Interval, and the ways to do this in flux are a little complicated. This re-enforces the idea that if you could have a statistics(col) function that returns all the basic stats I outlined. I think you could follow that with a map() that turns this into C.I.

nildenist commented 2 years ago

I am trying to compute confidence interval for mean(and maybe for ratio later). In order to calculate statistical confidence intervals there have been a simple formula: x(mean) - 1.96std_dev < x < x(mean) + 1.96std_dev

A little help came from the #flux community and thanks to Anais, we now have a function

data = (from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "SmartSensor_ABS_TR_Kaufman_axialAxisRmsVibration")
  |> filter(fn: (r) => r.name == "Indexer-202")
  |> duplicate(column: "_value", as: "pred")
  |> holtWinters(n: 60, interval: 10s,  column:"pred", withFit: true, seasonality : 8)

mean_val = (data
    |> mean(column: "_value")
    |> findRecord(fn: (key) => true, idx: 0))._value

//Calculate standard deviation from sample
stddev_val_initial = (data
    |> stddev()
    |> findRecord(fn: (key) => true, idx: 0))._value

stddev_val = math.abs(stddev_val_initial)*1.96

data
|> map(fn: (r) => ({r with
upper: mean_val + stddev_val,
lower: mean_val - stddev_val}))
|> window(
  every: 30s,
  period: 1m,
  offset: -5m,
  timeColumn: "_time",
  startColumn: "_start",
  stopColumn: "_stop",
  createEmpty: false)

The problem is here I use InfluxDB OSS 1.8.9 and my Flux version I guess 0.65. But findRecord is not supported on Flux 0.65.

Mainly I have hard times when I read all the documentation on Flux because we could not assign some basic statistics methos to a variable and could not gather them in the same equation.

Some of the features in the #flux that I need as in the following:

Hotelling T2 Metric is especially crucial for get a valuable insights for clusters. By this way #Flux will have to ability to cluster, compare them and the wonderful thing is I can do it in a stream way on chronograf without turn my face to batch Python airflow tasks.

And finally linear regression is the work-horce of all statistics. Including regression function on flux brings so much valuable insights on data and I can create wonderful products by just using it as a data engineer-statistician.

I create wonderful jobs by using TICK stack and I believe if statistics functions on flux were available, data science manner were changed.

github-actions[bot] commented 1 year ago

This issue has had no recent activity and will be closed soon.