twitter / AnomalyDetection

Anomaly Detection with R
GNU General Public License v3.0
3.56k stars 778 forks source link

Anom detection needs at least 2 periods worth of data #15

Open odp opened 9 years ago

odp commented 9 years ago

str(bar) 'data.frame': 506 obs. of 2 variables: $ timestamp: POSIXct, format: "2014-08-25 00:00:00" "2014-08-25 00:10:00" ... $ count : num 40465895 54157589 34727655 38576160 36686470 ...

res = AnomalyDetectionTs(bar, direction='both', max_anoms=0.02, plot=TRUE) Error in detect_anoms(all_data[[i]], k = max_anoms, alpha = alpha, num_obs_per_period = period, : Anom detection needs at least 2 periods worth of data

What's the definition of period here? The data contains a time series for about 4 days with granularity of 10 minutes.

Posting the data frame "bar" here https://www.dropbox.com/s/1j263k6srq18qpp/bar.Rda?dl=0

odp commented 9 years ago

After debugging.. When the granularity is decided as "min" by get_gran() we set period = 24*60 = 1440, that is we set number of observations to one per minute. Next we expect num_obs to be twice that of the period in detect_anoms()

if(num_obs < num_obs_per_period * 2) {
    stop("Anom detection needs at least 2 periods worth of data")
}

So the period is basically a day here and we are expecting more than 2*1440 = 2880 observations. It's implicit that the granularity should be one minute and we need at least two days worth of data.

Is there anything that can be done here when the granularity is multiple minutes?

owenvallis commented 9 years ago

Your totally right. The seasonality we were looking at was either daily (if the data was minutely or hourly), or weekly (if the data was daily). We added AnomalyDetectionVec() in order to support time series data of any granularity or period length. You can pass in the data column and manually specify the period length. Additional info on the Vec function can be found using help(AnomalyDetectionVec).

However, it would be nice for AnomalyDetectionTs() to support additional data granularities, or non-consecutive timestamps. Would you like to submit a patch, and @jhochenbaum and I can review?

odp commented 9 years ago

thanks. I'll try to come up with something.

elbamos commented 9 years ago

I get this even with daily data, and I've confirmed using the internal AnomalyDetection::: functions that it is correctly recognizing that the period. Minimal example:

quantmod::getSymbols("^GSPC") minimal <- data.frame(timestamp = index(GSPC), count = GSPC$GSPC.Adjusted) AnomalyDetectionTs(minimal, longterm = TRUE)

owenvallis commented 9 years ago

Hi Elbamos,

I was able to reproduce your error, and I'll look into posting a patch soon. In the interim, you can run the data using the following:

AnomalyDetectionVec(minimal[[2]], period=7, longterm_period=30, plot=T)

That will give a weekly periodicity, and assumes a longterm stable state of 30 days. Both parameters can be changed, but the longterm_period must be at least (period*2)+1.

The other issue was that the timestamps are currently doubles, while the Ts function is expecting a POSIX type. We are checking for that, but I think we are going to re work this to return the timestamps in the same format as they were passed in.

Hope that helps. Cheers,

elbamos commented 9 years ago

I'm just wondering if this ever got fixed...

rtjohn commented 8 years ago

From help(AnomalyDetectionVec):

period Defines the number of observations in a single period, and used during seasonal decomposition.

But what is the definition of a period? In the forecast package one uses a "frequency" argument which is specified in terms of a year: quarterly data would be frequency = 4, monthly data is frequency =12, daily data would be frequency = 365. What is the definition of "period" in this package? I have monthly data (1 row per month). What period do I use?

owenvallis commented 8 years ago

Hi rtjohn,

We used period here to denote the number of observations in a single cycle of the dominant seasonal component. This way we can define the number of observations per cycle without having to relate the number of cycles to some window, e.g., annual, quarterly, etc.

Best,

rtjohn commented 8 years ago

I think there are some terminology confusions here. Time series data generally can have trend, seasonal, and/or cyclic components, right? So you want users to "define the number of observations per cycle" (cyclic component)? But the definition of a cyclic component is that they are not of a fixed period...
Also isn't a seasonal component is by it's nature defined by a fixed known window: weekly, monthly, quarterly, etc? I can tell you're trying to help me out but your answer to my question for clarity on definition for "period" makes me need clarity for your definition of "seasonal" and "cycle". See what I mean?

So again for monthly data with let's say a strong true "season"-al pattern (changing drastically from winter, to spring to fall to summer) the period argument should be 3 right? I'd have 3 periods in a single "cycle" as you'd call it?

elbamos commented 8 years ago

@rtjohn while I totally relate to the point you're making, and I've found the issue confusing also, im pretty sure the package uses the same conventions for cycle and period definition as base R does. Which is definitely not friendly, but the package should conform to the convention of the platform.

On Feb 9, 2016, at 1:52 PM, Ryan Johnson notifications@github.com wrote:

I think there are some terminology confusions here. Time series data generally can have trend, seasonal, and/or cyclic components, right? So you want users to "define the number of observations per cycle" (cyclic component)? But the definition of a cyclic component is that they are not of a fixed period...

Also isn't a seasonal component is by it's nature defined by a fixed known window: weekly, monthly, quarterly, etc? I can tell you're trying to help me out but your answer to my question for clarity on definition for "period" makes me need clarity for your definition of "seasonal" and "cycle". See what I mean?

So again for monthly data with let's say a strong true "season"-al pattern (changing drastically from winter, to spring to fall to summer) the period argument should be 3 right? I'd have 3 periods in a single "cycle" as you'd call it?

— Reply to this email directly or view it on GitHub.

owenvallis commented 8 years ago

@rtjohn I see what you're saying. This Seasonal-Trend Decomposition paper was a big part of developing the package, and we based our naming conventions around their notion of "Seasonal, Trend, Residual" terminology. So in that case, Seasonal components would be the repeating cycles in the time series, the Trend would account for the variations from winter to summer, and the Residual should be the unimodal noise that we can use to detect the anoms. Also, Jordan and I have an audio background, so we tend to treat cycle as synonymous with period.

Let us know if we could improve the doc strings though.

aaishaosman commented 7 years ago

Hi all, I am quite new to this package and would like to use it for some analysis i am doing. I have data that is not regular ie. trading. Would i be able to use the AnomalyDetection to identify say irregular rices charged? If so, what would i set the "period" to, as on some days there might be a trade every second, or hour, and on some days none? i have data for roughly a year.

Any help will be greatly appreciated! Thanks!

asavla commented 7 years ago

Still get-Error in detect_anoms(all_data[[i]], k = max_anoms, alpha = alpha, num_obs_per_period = period, : Anom detection needs at least 2 periods worth of data Has this been resovled ?