twitter / AnomalyDetection

Anomaly Detection with R
GNU General Public License v3.0
3.55k stars 776 forks source link

Error in data.frame #29

Closed VladimirWrites closed 9 years ago

VladimirWrites commented 9 years ago

I am getting following error message: Error in data.frame(timestamp = all_anoms[[1]], anoms = all_anoms[[2]], : arguments imply differing number of rows: 1, 0

Data looks like this: 1 2014-12-28 00:00:00 46.25243 2 2014-12-28 01:00:00 43.16433 3 2014-12-28 02:00:00 40.06927 4 2014-12-28 03:00:00 39.27673 5 2014-12-28 04:00:00 40.28478 6 2014-12-28 05:00:00 47.17522 7 2014-12-28 06:00:00 56.34756 8 2014-12-28 07:00:00 66.45515

and method call is like this: AnomalyDetectionTs(data, max_anoms=0.05, threshold = "None", direction='both', plot=FALSE, only_last = "day", e_value = TRUE)

VladimirWrites commented 9 years ago

Hello again.

After further analysis I've discovered that this problem only occurs when there is only one detected anomaly in midnight, and only when e_value parameter is set to TRUE value. By debugging AnomalyDetectionTs function I've found that the error occurs in line 241:

    anoms <- data.frame(timestamp = all_anoms[[1]], anoms = all_anoms[[2]], 
        expected_value = subset(seasonal_plus_trend[[2]], 
            as.POSIXlt(seasonal_plus_trend[[1]], tz = "UTC") %in% 
              all_anoms[[1]]))

The values are:

all_anoms timestamp count 181250 2015-03-30 120.6

tail(seasonal_plus_trend) timestamp count 2153 2015-03-29 21:00:00 41 2154 2015-03-29 22:00:00 40 2155 2015-03-29 23:00:00 39 2156 2015-03-30 00:00:00 39 2157 2015-03-30 01:00:00 36 2158 2015-03-30 02:00:00 35

As you can see, the value from all_anoms exists in seasonal_plus_trend (it's the value 2015-03-30 00:00:00), but the code

subset(seasonal_plus_trend[[2]], as.POSIXlt(seasonal_plus_trend[[1]], tz = "UTC") %in% all_anoms[[1]])

doesn't return value, because

as.POSIXlt(seasonal_plus_trend[[1]], tz = "UTC") %in% all_anoms[[1]])

returns a vector of all FALSE values, which shouldn't be the case. Considering all this, the problem exists because 2015-03-30 00:00:00 (from seasonal_plus_trend) != 2015-03-30 (form all_anoms), although they are the same timestamps. This schould be corrected so that the value in all_anoms is alwalys in the same format as the value in seasonal_plus_trend (in my case, "%Y-%m-%d %H:%M:%S".

I found your package very useful and I hope you will fix this error soon.

Regards

owenvallis commented 9 years ago

Thanks for the heads up, and glad the package has been useful for you. We'll take a look at this a submit a patch.

Regards

jhochenbaum commented 9 years ago

@vlad1m1r990 Do you have / can you share some data with us so we can make sure we handle your scenario appropriately? Thanks!

VladimirWrites commented 9 years ago

Hello. Thank you for your quick response.

The data set that I've tried with is on this link: https://drive.google.com/file/d/0BxWpQRPFhtqGS21UenhCa180OEk/view.

And the function call is below:

x <- read.csv("data.csv") x$date <- as.POSIXct(strptime(x$date, "%Y-%m-%d %H:%M", tz = "UTC")) anomalyDetectionResult <- AnomalyDetectionTs(x, max_anoms=0.2, threshold = "None", direction='both', plot=FALSE, only_last = "day", e_value = TRUE)

Thanks again!

jhochenbaum commented 9 years ago

Thanks, will look into it!

VladimirWrites commented 9 years ago

Thanks! :)

VladimirWrites commented 9 years ago

Hi again. I don't know if this is an intended behaviour, but after this change the timestamp column of the resulting anoms data frame is now of type Factor and not POSIXct/POSIXlt.

jhochenbaum commented 9 years ago

Looks out our tests weren't fully covering the types of returned values so even though everything appeared valid, the type was not being checked and that slipped through. Thanks!

There are a couple things at play here... there issue you exposed which stems from the fact that it looks like the POSIX classes in R strip away midnight from a datetime (unless it's in a list with other datetimes), which makes the compare in your dataset invalid. The only way I was able to see around this was manually using format() to restructure the object with hour minute second, but as you just noticed, that changes the type to Factor. Quick fix here is to convert back to POSIXlt.

Right now, internally everything gets converted to POSIXlt, which is a larger issue we'd like to fix. Ideally, we will preserve whatever format and tz come in (theres another ticket for this). I'm going to submit a quick fix to make it convert back to a POSIXlt which will at least make it a valid timestamp and bring it back to where things were, but a more substantial patch will be coming soon to address the larger issue of preserving the original format and timezone.

Thanks!

erikriverson commented 9 years ago

@jhochenbaum Hey just a couple quick comments here as I ran into this too. As you note, the format argument for ?format.POSIXct says:

format: A character string. The default for the 'format' methods is '"%Y-%m-%d %H:%M:%S"' if any element has a time component which is not midnight, and '"%Y-%m-%d"' otherwise.

When you call the format function, you create a character vector. And then when a new data.frame is generated using the character vector, the class of that variable becomes a factor as you saw. You can keep the class of the variable as a character if you supply the stringsAsFactors = FALSE option to the data.frame function. E.g., compare

> str(data.frame(a = letters))
  'data.frame': 26 obs. of  1 variable:
    $ a: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ... 

> str(data.frame(a = letters, stringsAsFactors = FALSE))
  'data.frame': 26 obs. of  1 variable:
    $ a: chr  "a" "b" "c" "d" ...