twitter / AnomalyDetection

Anomaly Detection with R
GNU General Public License v3.0
3.55k stars 776 forks source link

High frequency data sets anomaly detection #22

Open FlowQ opened 9 years ago

FlowQ commented 9 years ago

I am trying to perform an anomaly detection on a data set with very high frequency (more than 5/10 row per seconds) and the timestamps are not consecutive (sometimes there is no row for as second

Exemple : 09:23:59 2014-12-19 09:23:59 2014-12-19 09:24:00 2014-12-19 09:24:00 2014-12-19 09:24:02 2014-12-19 09:24:02 2014-12-19 09:24:02

I understand that I should use AnomalyDetectionTs to perform the detection on this type of set.

But my set has 50K rows but the function cannot compute the detection and crashes. Maybe it is also due to the fact that the timeseries are not spaced with a fixe time (sometimes 1sec, or 0 or even 2 secs)?

What are your recommendations to work with this type of dataset ?

Thanks,

Flow

pepijn commented 9 years ago

This is how I did it for my dataset with hourly frequency. It also sets the missing rows' count to 0 instead of NA.

dates.min <- as.POSIXct(dates.min.text)
dates.max <- as.POSIXct(dates.max.text)

dates.seq.all <- seq(dates.min, dates.max, by='hour')
dates.all <- data.frame(list(date=dates.seq))

data <- merge(dates.all, data.db, all=TRUE)
data$count[is.na(data$count)] <- 0
jhochenbaum commented 9 years ago

We're looking into more gracefully handling datasets with missing values. Patch soon...

jhochenbaum commented 9 years ago

Owen and I looked into this tonight and it's a tricky one. STL decomposition can't really handle datasets with NAs in them, however, here is what we propose. We're handling the cases where there are leading and/or trailing NAs, but will throw an exception when we detect non-leading NAs.

In the latter case, we recommend you use interpolation to replace the NAs. The zoo package provides such a function (linear interpolation) called na.approx.

Let us know your thoughts, thanks.