twitter / AnomalyDetection

Anomaly Detection with R
GNU General Public License v3.0
3.55k stars 776 forks source link

Suggestion: Identify and Remove Linear Trend Along with Seasonal Component #43

Open mmolaro opened 9 years ago

mmolaro commented 9 years ago

The generalized ESD method normalizes deviation from the mean based on an estimate of the population variance. If the data has an uncompensated, appreciable linear trend this is equivalent to estimating the noise in the data to be much higher than true noise in the signal and many outlying data points will be removed.

This package uses stl from the R stats library to remove the seasonal component means, and identfies the trend in the data but it doesn't remove it before doing the ESD analysis. My suggestion is to just use the remainder column of data_decomp for ESD analysis (optionally subtracting the median).

From https://github.com/twitter/AnomalyDetection/blob/master/R/detect_anoms.R

# -- Step 1: Decompose data. This returns a univarite remainder which will be used for anomaly detection. Optionally, we might NOT decompose.
    data_decomp <- stl(ts(data[[2L]], frequency = num_obs_per_period),
                       s.window = "periodic", robust = TRUE)

    # Remove the seasonal component, and the median of the data to create the univariate remainder
    data <- data.frame(timestamp = data[[1L]], count = (data[[2L]]-data_decomp$time.series[,"seasonal"]-median(data[[2L]])))

Here is a trivial example of the kind of issue this can cause: Run the example AnomalyDetectionVec(raw_data[,2], max_anoms=0.02, period=1440, direction=’both’, plot=TRUE) rplot1 Add a linear trend and run again new_data = raw_data + 0.01*(1:14398) AnomalyDetectionVec(new_data[,2], max_anoms=0.02, period=1440, direction=’both’, plot=TRUE) rplot2

owenvallis commented 9 years ago

Thanks for the suggestion. We had actually run into this, and we found that extreme anomalies can severely distort the derived trend component from STL. This meant we couldn't trust the derived trend to be free from the influence of extreme anomalies. We opted to replace the STL trend with the median, our assumption being that small enough windows of data would look flat. This removed the distorted trend, but to your point, it left us unable to track trends in the data.

Our solution for including the trend was to create a piecewise approach where we slice the data into small windows, and then assume the trend to be flat within each window. This creates a sliding context from which to derive the anomalies. You can turn it on by enabling the longterm parameter. We also looked at using a b-spline to derive the trend. This yielded comparable results to the piecewise approach, however, the performance slowed down with longer time series.

mmolaro commented 9 years ago

Thanks for the quick reply. I appreciate the strategy of using the longterm parameter but am not really satisfied with piecewise constant for trends. Can you post an example data series where STL gives unsatisfactory trend results because of the anomalies we are aiming to detect? When using STL for the trend, between the robust flag, and a pretty large t.window, I would expect that outliers would have a pretty small impact on the trend. t.window's default size is ~ num_obs_per_period, increasing this to something like 2 or 3 x num_obs_per_period might be better. This is based on the assumption that the trend component variation is quite a bit slower than the seasonality in the signal. If the trend component is varying at a similar rate as the seasonality only quite extreme abnormalities will be identifiable with pretty much any approach unless trend signals with multiple timescales are removed.

The better the trend model, the more anomalies will be identified by ESD. Clearly there is a confidence question about the underlying trend for any model (constant, piecewise constant, whatever LOESS/STL finds), but I tend to have more prior expectation in things being smoothly varying than piecewise constant.