timescale / timescaledb-toolkit

Extension for more hyperfunctions, fully compatible with TimescaleDB and PostgreSQL 📈
https://www.timescale.com
Other
370 stars 46 forks source link

Analytical functions for Anomaly Detection #45

Open mkindahl opened 3 years ago

mkindahl commented 3 years ago

What's the functionality you would like to add

Analytical functions to support anomaly detection.

Why should this feature be added?

The most common methods for anomaly detection are predictive models based on ARIMA and VAR and clustering-based methods based on DBSCAN.

ARIMA is a family of methods that try to predict the time series using the historical value of the series. In contrast, VAR is a family of methods that try to correlate several time-series to see if there is a pattern that can be exploited.

In contrast, clustering-based methods try to compute clusters of similar values and detect anomalies by identifying points that are far from any of the existing clusters. Although the basic methods are not focused on time-series, there are extensions that allow it to be used with time-series quite successfully.

Computing Parameters

All of the methods above rely on computing a set of parameters based on historical data. The amount of historical data needed depends on the method picked and the time series: in some cases less data works find and in some cases more data is required.

This means that it is necessary to have some concept of how much of the (recent) history that should be used to compute the parameters, and then a place to store the parameters is necessary.

For example, we could compute the parameters using a dedicated aggregation function:

    SELECT device_id, arima_compute("time", cpu, disk)
      FROM device_data
    GROUP BY device_id;

And then use it in a anomaly detection in something like this:

    SELECT device_id, arima_detect(params, "time", cpu, disk)
    FROM device_data
    GROUP BY device_id;

Since storage for the parameters is required, it might be possible to use the index access methods to make parameters at each time point available and then allow them to be used when searching the table, similar to how Bloom filters are used to filter out large proportions of rows from a scan.

What scale is this useful at?

Any scale.

Open Questions

It is not clear which of the approaches that are best for different situations, so some testing on real time series is necessary.

When "training" the functions by computing parameters, it is necessary to use some sort of training set, which ideally should be the time-series that it is going to be used for. Once the parameters are computed, they need to be stored somewhere and used for the analysis of the time series.

Since the parameters are changing over time, a new index access method might be possible to use: it would allow storing the parameters for each time point, but it is unclear how this would be used when querying the time stream. A good UI need to be developed for this.

Alternatives

There are some experiments with using deep networks for anomaly detection.

References

JLockerman commented 3 years ago

Thank you for opening this issue @mkindahl!

I'm not familiar with ARIMA models, so a couple of questions on how this works operationally.

  1. How often do the parameters change? Is this the kind of things where you can compute the parameters early on in the dataset's lifetime and keeping using them for a long time thereafter, or the kind of thing where you're going to be frequently recalculating them?
  2. How large are the parameters? (I ask because, as you are no doubt aware, that directly determines the storage needed to store many parameter sets) A quick google suggests that it's around 3 values, so approximately 192 bytes, does that sound right?
  3. How expensive do you think arima_detect(...) would be?

This is jumping the gun a bit, but the reason I ask is that looking at your example it feels like the parameters might be a natural fit for continuous aggregates, if the their computation can be divvied up appropriately, and I wonder if we could create a similar continuous aggregate or view to do you search suggestion...

samgaw commented 3 years ago

I'm not sure how useful this is but madlib has an existing implementation of ARIMA for reference.

https://github.com/apache/madlib/src/ports/postgres/modules/tsa/arima.sql_in

But yes, continuous aggregates would be a great fit for this.

JLockerman commented 3 years ago

link to the C++ code if I understand the repo's layout correctly

JLockerman commented 3 years ago

(fixed link to the SQL docs https://github.com/apache/madlib/blob/master/src/ports/postgres/modules/tsa/arima.sql_in)

JLockerman commented 3 years ago

I believe this is the rendered documentation.

JLockerman commented 3 years ago

and I think this is the training code