Open mkindahl opened 3 years ago
Thank you for opening this issue @mkindahl!
I'm not familiar with ARIMA models, so a couple of questions on how this works operationally.
arima_detect(...)
would be?This is jumping the gun a bit, but the reason I ask is that looking at your example it feels like the parameters might be a natural fit for continuous aggregates, if the their computation can be divvied up appropriately, and I wonder if we could create a similar continuous aggregate or view to do you search suggestion...
I'm not sure how useful this is but madlib has an existing implementation of ARIMA for reference.
https://github.com/apache/madlib/src/ports/postgres/modules/tsa/arima.sql_in
But yes, continuous aggregates would be a great fit for this.
link to the C++ code if I understand the repo's layout correctly
(fixed link to the SQL docs https://github.com/apache/madlib/blob/master/src/ports/postgres/modules/tsa/arima.sql_in)
I believe this is the rendered documentation.
and I think this is the training code
What's the functionality you would like to add
Analytical functions to support anomaly detection.
Why should this feature be added?
The most common methods for anomaly detection are predictive models based on ARIMA and VAR and clustering-based methods based on DBSCAN.
ARIMA is a family of methods that try to predict the time series using the historical value of the series. In contrast, VAR is a family of methods that try to correlate several time-series to see if there is a pattern that can be exploited.
In contrast, clustering-based methods try to compute clusters of similar values and detect anomalies by identifying points that are far from any of the existing clusters. Although the basic methods are not focused on time-series, there are extensions that allow it to be used with time-series quite successfully.
Computing Parameters
All of the methods above rely on computing a set of parameters based on historical data. The amount of historical data needed depends on the method picked and the time series: in some cases less data works find and in some cases more data is required.
This means that it is necessary to have some concept of how much of the (recent) history that should be used to compute the parameters, and then a place to store the parameters is necessary.
For example, we could compute the parameters using a dedicated aggregation function:
And then use it in a anomaly detection in something like this:
Since storage for the parameters is required, it might be possible to use the index access methods to make parameters at each time point available and then allow them to be used when searching the table, similar to how Bloom filters are used to filter out large proportions of rows from a scan.
What scale is this useful at?
Any scale.
Open Questions
It is not clear which of the approaches that are best for different situations, so some testing on real time series is necessary.
When "training" the functions by computing parameters, it is necessary to use some sort of training set, which ideally should be the time-series that it is going to be used for. Once the parameters are computed, they need to be stored somewhere and used for the analysis of the time series.
Since the parameters are changing over time, a new index access method might be possible to use: it would allow storing the parameters for each time point, but it is unclear how this would be used when querying the time stream. A good UI need to be developed for this.
Alternatives
There are some experiments with using deep networks for anomaly detection.
References