Open zachmayer opened 9 years ago
Thanks, Owen and I will look into it Zach!
Thanks!
This looks like a useful package, thanks!
Perhaps the 'timestamp' column should be required to be of class POSIXct? I don't believe base::merge works when the 'by' column is of class POSIXlt. For example:
x <- Sys.time()
test1 <- data.frame(timestamp = as.POSIXct(x:(x + 9), origin = "1970-01-01"),
y = rnorm(1:10))
test2 <- data.frame(timestamp = test1$timestamp, z = rnorm(1:10))
## ok when timestamp is POSIXct
merge(test1, test2, by = "timestamp")
## not ok when timestamp is POSIXlt
test1$timestamp <- test2$timestamp <- as.POSIXlt(test1$timestamp)
merge(test1, test2, by = "timestamp")
With the timezone issue, if the 'timestamp' column is POSIXct, the timezone on the object returned can be made to match that of the input data. If the timestamp column is not POSIXct, it could be converted using as.POSIXct instead of strptime in format_timestamp, and given a default time zone of UTC possibly.
@erikeldridge There's really 2 issues here:
AnomalyDetectionTs
is off by 4 hours).@erikeldridge I didn't see the last paragraph of your comment. I agree, if the timestamp is of class POSIXct, the timezone should be kept, and if the timestamp must be converted the timezone should be preserved if possible.
Thanks for the comments guys -- Owen and I will look into this later today, but on glance you're right, we should preserve the timestamp/timezone. Stay tuned...
@jhochenbaum Awesome, thank you!
@zachmayer, would you like to submit a patch and @owenvallis and I can review?
@jhochenbaum I haven't had time to take a crack on this yet, but if I do I'll submit a PR.
Thanks @zachmayer
@zachmayer After my PR https://github.com/twitter/AnomalyDetection/pull/92, the timestamp
column is now stored as POSIXct rather than POSIXlt and the timezone attribute is kept. Now your code demo produces following output
> library(AnomalyDetection)
> data <- raw_data
> str(data)
'data.frame': 14398 obs. of 2 variables:
$ timestamp: POSIXct, format: "1980-09-25 14:01:00" "1980-09-25 14:02:00" "1980-09-25 14:03:00" ...
$ count : num 182 176 184 178 165 ...
> attr(data$timestamp, "tzone")
[1] "UTC"
>
> res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = FALSE)
> attr(res$anoms$timestamp, 'tzone')
[1] "UTC"
You should check timezone attribute for res$anoms$timestamp
instead of res$anoms
. Therefore, the timezone attribute was not dropped.
However, your code snippet exposed another problem of AnomalyDetection package, which is the input timestamp with non UTC timezone. The problem is explained by following code output.
> attr(data$timestamp, "tzone") <- "America/New_York"
> attr(data$timestamp, "tzone")
[1] "America/New_York"
> res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = FALSE)
> attr(res$anoms$timestamp, 'tzone')
[1] "UTC"
> head(res$anoms)
timestamp anoms
1 1980-09-25 12:05:00 21.3510
2 1980-09-29 02:40:00 193.1036
3 1980-09-30 16:26:00 180.8990
4 1980-09-30 16:30:00 178.8220
5 1980-09-30 16:31:00 198.3260
6 1980-09-30 16:32:00 203.9010
> data[120:130, ]
timestamp count
120 1980-09-25 12:00:00 134.646
121 1980-09-25 12:01:00 157.175
122 1980-09-25 12:02:00 150.374
123 1980-09-25 12:03:00 151.579
124 1980-09-25 12:04:00 133.844
125 1980-09-25 12:05:00 21.351
126 1980-09-25 12:06:00 120.827
127 1980-09-25 12:07:00 144.293
128 1980-09-25 12:08:00 137.726
129 1980-09-25 12:09:00 131.608
130 1980-09-25 12:10:00 135.320
As you see, the 1st row of res$anoms
is corresponding to the 125th row of data
. The timestamp
columns are of the same value and the only difference is the timezone attribute.
I will try to fix the problem.
I have fixed it by PR https://github.com/twitter/AnomalyDetection/pull/92
Now the AnomalyDetectionTs keeps the original timezone attr as the input timestamp, see following output.
> attr(data$timestamp, "tzone") <- "America/New_York"
> attr(data$timestamp, "tzone")
[1] "America/New_York"
> res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = TRUE)
> attr(res$anoms$timestamp, 'tzone')
[1] "America/New_York"
> head(res$anoms)
timestamp anoms
1 1980-09-25 12:05:00 21.3510
2 1980-09-29 02:40:00 193.1036
3 1980-09-30 16:26:00 180.8990
4 1980-09-30 16:30:00 178.8220
5 1980-09-30 16:31:00 198.3260
6 1980-09-30 16:32:00 203.9010
> data[120:130, ]
timestamp count
120 1980-09-25 12:00:00 134.646
121 1980-09-25 12:01:00 157.175
122 1980-09-25 12:02:00 150.374
123 1980-09-25 12:03:00 151.579
124 1980-09-25 12:04:00 133.844
125 1980-09-25 12:05:00 21.351
126 1980-09-25 12:06:00 120.827
127 1980-09-25 12:07:00 144.293
128 1980-09-25 12:08:00 137.726
129 1980-09-25 12:09:00 131.608
130 1980-09-25 12:10:00 135.320
Hi team,
Sorry for this up, I mean really.
But with this script :
library(AnomalyDetection) data(raw_data) data <- raw_data data$timestamp <- as.POSIXct(data$timestamp) attr(data$timestamp, "tzone") <- "America/New_York" attr(data$timestamp, "tzone") res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = TRUE) attr(res$anoms$timestamp, 'tzone') res$plot
Here what I have :
library(AnomalyDetection)
data(raw_data) data <- raw_data data$timestamp <- as.POSIXct(data$timestamp) attr(data$timestamp, "tzone") <- "America/New_York" attr(data$timestamp, "tzone") [1] "America/New_York" res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = TRUE) attr(res$anoms$timestamp, 'tzone') [1] "UTC" res$plot Error: Column
x
is a date/time and must be stored as POSIXct, not POSIXlt
I cannot understand why, I've checked all what you've said, but no result... May I have some help ?
@Maryoda2 The Error: Column x is a date/time and must be stored as POSIXct, not POSIXlt
has been fixed by my PR https://github.com/twitter/AnomalyDetection/pull/92, which has not been merged into the master branch (It seems that the twitter/AnomalyDetection
repository has not been maintained since Sep, 2015). You could install AnomalyDetection package that I modified.
devtools::install_github("caijun/AnomalyDetection")
Then rerunning your script will produce the expected results without errors (I tested your script).
Oh thanks a lot !
Well... in fact sometimes it writes :
Error in aggregate.data.frame(x[2], format(x[1], "%Y-%m-%d %H:%M:00"), : 'by' must be a list
@Maryoda2 could you give an example reproducing the error?
library(AnomalyDetection) data <-dems data$timestamp <- as.POSIXct(data$timestamp) attr(data$timestamp, "tzone") <- "America/New_York" attr(data$timestamp, "tzone") res <- AnomalyDetectionTs(data, max_anoms = 0.02, direction = 'both', plot = TRUE) attr(res$anoms$timestamp, 'tzone') res$plot
With another dataset, I've tried to translate your script. I created a dataset called dems and attempted to format it exactly like raw_data, but it doesn't work
Please run str(data)
to see the structure of data
and post the result. Otherwise without your data, I cannot reproduce the error to provide help.
str(raw_data) 'data.frame': 14398 obs. of 2 variables: $ timestamp: POSIXct, format: "1980-09-25 10:01:00" "1980-09-25 10:02:00" ... $ count : num 182 176 184 178 165 ... str(dems) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1127 obs. of 2 variables: $ timestamp: POSIXct, format: "2018-07-04 09:51:41" "2018-07-04 09:51:51" ... $ count : num 2.6 2.6 2.6 2.6 2.5 2.4 2.4 2.4 2.5 2.7 ...
This is because your timestamp is not in the format of "%Y-%m-%d %H:%M:00", which means AnomalyDetectionTs()
detect anomalies every minute, while your timestamp appears to be every 10 seconds. If you really need to detect anomalies at a time resolution of less than a minute, then the code of twitter/AnomalyDetection
has to be modified to support such a feature.
Oh I see, How can I edit the code of twitter/AnomalyDetection, in fact the caijun/Twitter (yours) ?
@Maryoda2 Could you share a sample of your data, which I can use for reproducing the error and testing?
I think the problem you encountered has been reported in issue https://github.com/twitter/AnomalyDetection/issues/77, which has been resolved by PRs https://github.com/twitter/AnomalyDetection/pull/98 and https://github.com/twitter/AnomalyDetection/pull/79
Although these PRs have not been merged into the twitter/AnomalyDetection master branch, but has been merged into isalgueiro/AnomalyDetection. You can try this fork of AnomalyDetection by
devtools::install_github("isalgueiro/AnomalyDetection")
Well thanks for your answer With isalgueiro it doesn't work, same error. Here is a sample
@Maryoda2 the data you provided is not enough to reproduce the error. Please provide data with at least 20000 rows.
20.000 is no more a sample, but we can't detect anomalies on dataset less than 20.000 ?
@Maryoda2 Using the data you provided, I encountered the error "Anom detection needs at least 2 periods worth of data". Hence, I need more data to make sure this error was not because of limited data.
In fact for the moment this is all what I have... So game over then ?
@Maryoda2 this error has been discussed in issues https://github.com/twitter/AnomalyDetection/issues/15 and https://github.com/twitter/AnomalyDetection/issues/42, perhaps they can provide some useful information for you to fix this error by yourself, e.g. fork this repository and modify the source code.
AnomalyDetectionTs should keep the timestamp column of the output dataset as-is, rather than converting to POSIXct and dropping the timezone attribute:
Dropping the timezone is problematic if you wish to merge the anomalies back to the main dataset: