twitter / AnomalyDetection

Anomaly Detection with R
GNU General Public License v3.0
3.55k stars 776 forks source link

Issues using daily data with the "long_term" option #20

Open cozos opened 9 years ago

cozos commented 9 years ago

I'm not sure that this package was meant to be used on daily data, as Twitter seems to be using it for very granular minutely data. But anyways, here are the issues I've encountered

Data Set: Daily timestamp/count pairs for the past two years (so around 730 rows)

With "long_term=true" and daily data (therefore "gran=day" "period = 7"), AnomalyDetectionTs will split the dataset into two week periods of 14 rows for each day. (ts_anom_detection.R, lines 168-177)

This causes two issues:

  1. detect_anoms is passed a dataset of 14 rows and num_obs_per_period of 7, which the causes the STL function to throw the error "stl : series is not periodic or has less than two periods"

    stl(ts(data[[2]], frequency=num_obs_per_period), s.window="periodic", robust=TRUE) (detect_anoms.R, line 33)

    I think this happens for one of two reasons. One, the STL function needs to dataset to have 2*frequency + 1 observations, which is a given for minutely/hourly data in a two week period, but not for days (14 days in two weeks). Two, it could happen when the last two-week subset is less than two weeks. For example, 53 weeks of data with the long_term enabled will create 26 2-week intervals and 1 1-week interval - the last 1-week interval will throw "series is not periodic or has less than two periods" when passed into STL.

  2. max_anoms on two-week intervals of daily data will always end up being 0 (0.02 * 14 days = 0), unless you have a very large max_anoms. Two week periods are probably too small for daily data.

Apologies if the expectation was to fix the issues and create a pull-request :), I'm not sure if the S-H-ESD is meant to be used on daily data.

-Arwin from Adroll

owenvallis commented 9 years ago

Hi Arwin,

Thanks for trying out the package. As you mentioned, it seems like the daily granularity might be causing an issue with the long_term option. Line 180 in ts_anom_detection.R should jump to the end and grab two weeks of data from that point backwards. However, as you mention the daily data seems to be only grabbing 14 days instead of 14 days +1. Additionally, you bring up a good point about the granularity causing an issue with Max anoms.

We'll certainly look into the issue, but if you'd like to submit a patch then @jhochenbaum and I can review.