Feature request: basic statistics

wz2b commented 2 years ago

I am trying to compute stddev(col) / mean(col) as an aggregation function using aggregateWindow(). I'm finding it's not that easy to do - I basically have to fetch the data, window and aggregate over it twice, then join the data back together. It occurred to me that it might not be a bad idea to have some kind of statistics(col) function that returns not just a single column but a few columns for the basic stats one might want: mean, max, min, stddev, mode, sum, and count - and return them all at once. These things could then be useful for when you need to later combine stats, including situations where you need to do a piecewise or rolling standard deviation. So I thought I would throw this out there as a feature request and see what people think.

wz2b commented 2 years ago

Today, a related topic came up. Someone wanted to compute Confidence Interval, and the ways to do this in flux are a little complicated. This re-enforces the idea that if you could have a statistics(col) function that returns all the basic stats I outlined. I think you could follow that with a map() that turns this into C.I.

nildenist commented 2 years ago

I am trying to compute confidence interval for mean(and maybe for ratio later). In order to calculate statistical confidence intervals there have been a simple formula: x(mean) - 1.96std_dev < x < x(mean) + 1.96std_dev

A little help came from the #flux community and thanks to Anais, we now have a function

data = (from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "SmartSensor_ABS_TR_Kaufman_axialAxisRmsVibration")
  |> filter(fn: (r) => r.name == "Indexer-202")
  |> duplicate(column: "_value", as: "pred")
  |> holtWinters(n: 60, interval: 10s,  column:"pred", withFit: true, seasonality : 8)

mean_val = (data
    |> mean(column: "_value")
    |> findRecord(fn: (key) => true, idx: 0))._value

//Calculate standard deviation from sample
stddev_val_initial = (data
    |> stddev()
    |> findRecord(fn: (key) => true, idx: 0))._value

stddev_val = math.abs(stddev_val_initial)*1.96

data
|> map(fn: (r) => ({r with
upper: mean_val + stddev_val,
lower: mean_val - stddev_val}))
|> window(
  every: 30s,
  period: 1m,
  offset: -5m,
  timeColumn: "_time",
  startColumn: "_start",
  stopColumn: "_stop",
  createEmpty: false)

The problem is here I use InfluxDB OSS 1.8.9 and my Flux version I guess 0.65. But findRecord is not supported on Flux 0.65.

Mainly I have hard times when I read all the documentation on Flux because we could not assign some basic statistics methos to a variable and could not gather them in the same equation.

Some of the features in the #flux that I need as in the following:

Confidence Interval
Correlation (there have been only covariance, I am very surprised when I could not see correlation but covariance)
Huber M Statistics(A robust estimator that belongs M-Estimator Family in Statistics, it works perfect alost nearly every real world data)
Phi Coefficient(for nominal scaled variables, tags can able to be perfectly analyzed)
Contingency Correlation (Pearson's)
Cramer's V (for non square pivots)
Eta Coefficient(in order to find stream relationships between tags and fields)
Cohen's Kappa Statistics(in order to calculate accuracy of our predictions)
Distance Metrics(Maybe now #flux may not have to ability to do k-means, but if distance metrics were available in #flux, I can easily o some clustering analysis nearly ML) These distance metrics are: Euclidean Distance Chebychv Distance Manhattan City-Blok Distance Pearson Correlation Coefficient Jaccard Distance Metric Hotelling T2 Metric

Hotelling T2 Metric is especially crucial for get a valuable insights for clusters. By this way #Flux will have to ability to cluster, compare them and the wonderful thing is I can do it in a stream way on chronograf without turn my face to batch Python airflow tasks.

And finally linear regression is the work-horce of all statistics. Including regression function on flux brings so much valuable insights on data and I can create wonderful products by just using it as a data engineer-statistician.

I create wonderful jobs by using TICK stack and I believe if statistics functions on flux were available, data science manner were changed.

github-actions[bot] commented 1 year ago

This issue has had no recent activity and will be closed soon.

influxdata / flux

Feature request: basic statistics #4229