twitter / AnomalyDetection

Anomaly Detection with R
GNU General Public License v3.0
3.56k stars 779 forks source link

AnomalyDetectionTs drops timezone from POSIXct objects and converts POSIXlt to POSIXct #2

Open zachmayer opened 9 years ago

zachmayer commented 9 years ago

AnomalyDetectionTs should keep the timestamp column of the output dataset as-is, rather than converting to POSIXct and dropping the timezone attribute:

library(AnomalyDetection)
data(data_raw)
data <- raw_data
data$timestamp <- as.POSIXct(data$timestamp)
attr(data$timestamp, "tzone") 
attr(data$timestamp, "tzone") <- "America/New_York"

res = AnomalyDetectionTs(data, max_anoms=0.002, direction='both', plot=FALSE)
attr(res$anoms, 'tzone')

Dropping the timezone is problematic if you wish to merge the anomalies back to the main dataset:

> merge(data, res$anoms, by='timestamp')
             timestamp   count    anoms
1  1980-09-28 22:40:00 114.308 193.1036
2  1980-09-30 12:26:00 130.222 180.8990
3  1980-09-30 12:30:00 126.721 178.8220
4  1980-09-30 12:31:00 152.956 198.3260
5  1980-09-30 12:32:00 136.004 203.9010
6  1980-09-30 12:33:00 134.589 200.3090
7  1980-09-30 12:34:00 122.490 178.4910
8  1980-09-30 12:36:00 126.806 183.0180
9  1980-09-30 12:38:00 117.334 186.8230
10 1980-09-30 12:39:00 121.061 183.6600
11 1980-09-30 12:40:00 116.924 179.2760
12 1980-09-30 12:41:00 129.097 197.2830
13 1980-09-30 12:42:00 119.566 191.0970
14 1980-09-30 12:43:00 137.694 194.6700
15 1980-09-30 12:46:00 136.876 200.8160
16 1980-09-30 12:47:00 125.126 186.2350
17 1980-09-30 12:48:00 122.008 185.4210
18 1980-09-30 12:49:00 127.935 178.9580
19 1980-09-30 12:51:00 138.159 203.2310
20 1980-09-30 12:52:00 130.939 181.3540
21 1980-09-30 12:53:00 122.351 186.7780
22 1980-09-30 12:55:00 121.120 176.1250
23 1980-09-30 12:56:00 122.707 181.5140
24 1980-09-30 12:57:00 118.378 175.2610
25 1980-10-05 05:18:00 101.332  40.0000
26 1980-10-05 05:28:00 103.798 250.0000
27 1980-10-05 05:38:00 100.839  40.0000
jhochenbaum commented 9 years ago

Thanks, Owen and I will look into it Zach!

zachmayer commented 9 years ago

Thanks!

erikriverson commented 9 years ago

This looks like a useful package, thanks!

Perhaps the 'timestamp' column should be required to be of class POSIXct? I don't believe base::merge works when the 'by' column is of class POSIXlt. For example:

x <- Sys.time()
test1 <- data.frame(timestamp = as.POSIXct(x:(x + 9), origin = "1970-01-01"),
                    y = rnorm(1:10))
test2 <- data.frame(timestamp = test1$timestamp, z = rnorm(1:10))

## ok when timestamp is POSIXct
merge(test1, test2, by = "timestamp")

## not ok when timestamp is POSIXlt
test1$timestamp <- test2$timestamp <- as.POSIXlt(test1$timestamp)
merge(test1, test2, by = "timestamp")

With the timezone issue, if the 'timestamp' column is POSIXct, the timezone on the object returned can be made to match that of the input data. If the timestamp column is not POSIXct, it could be converted using as.POSIXct instead of strptime in format_timestamp, and given a default time zone of UTC possibly.

zachmayer commented 9 years ago

@erikeldridge There's really 2 issues here:

  1. POSIXlt is silently converted to POSIXct. Perhaps forcing the use of POSIXct is the way to go, but we should at least get a warning before conversion. Personally, I'm in favor of the output using the exact same data type as the input and letting the user worry about how to merge a POSIXlt object.
  2. Even if the user provided a POSIXct column, the timezone is silently dropped, which means the return data has the wrong timezone and merges incorrectly with the original data (see my example above: the data returned by AnomalyDetectionTs is off by 4 hours).
zachmayer commented 9 years ago

@erikeldridge I didn't see the last paragraph of your comment. I agree, if the timestamp is of class POSIXct, the timezone should be kept, and if the timestamp must be converted the timezone should be preserved if possible.

jhochenbaum commented 9 years ago

Thanks for the comments guys -- Owen and I will look into this later today, but on glance you're right, we should preserve the timestamp/timezone. Stay tuned...

zachmayer commented 9 years ago

@jhochenbaum Awesome, thank you!

jhochenbaum commented 9 years ago

@zachmayer, would you like to submit a patch and @owenvallis and I can review?

zachmayer commented 9 years ago

@jhochenbaum I haven't had time to take a crack on this yet, but if I do I'll submit a PR.

jhochenbaum commented 9 years ago

Thanks @zachmayer

caijun commented 6 years ago

@zachmayer After my PR https://github.com/twitter/AnomalyDetection/pull/92, the timestamp column is now stored as POSIXct rather than POSIXlt and the timezone attribute is kept. Now your code demo produces following output

> library(AnomalyDetection)
> data <- raw_data
> str(data)
'data.frame':   14398 obs. of  2 variables:
 $ timestamp: POSIXct, format: "1980-09-25 14:01:00" "1980-09-25 14:02:00" "1980-09-25 14:03:00" ...
 $ count    : num  182 176 184 178 165 ...
> attr(data$timestamp, "tzone")
[1] "UTC"
> 
> res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = FALSE)
> attr(res$anoms$timestamp, 'tzone')
[1] "UTC"

You should check timezone attribute for res$anoms$timestamp instead of res$anoms. Therefore, the timezone attribute was not dropped.

However, your code snippet exposed another problem of AnomalyDetection package, which is the input timestamp with non UTC timezone. The problem is explained by following code output.

> attr(data$timestamp, "tzone") <- "America/New_York"
> attr(data$timestamp, "tzone")
[1] "America/New_York"
> res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = FALSE)
> attr(res$anoms$timestamp, 'tzone')
[1] "UTC"
> head(res$anoms)
            timestamp    anoms
1 1980-09-25 12:05:00  21.3510
2 1980-09-29 02:40:00 193.1036
3 1980-09-30 16:26:00 180.8990
4 1980-09-30 16:30:00 178.8220
5 1980-09-30 16:31:00 198.3260
6 1980-09-30 16:32:00 203.9010
> data[120:130, ]
              timestamp   count
120 1980-09-25 12:00:00 134.646
121 1980-09-25 12:01:00 157.175
122 1980-09-25 12:02:00 150.374
123 1980-09-25 12:03:00 151.579
124 1980-09-25 12:04:00 133.844
125 1980-09-25 12:05:00  21.351
126 1980-09-25 12:06:00 120.827
127 1980-09-25 12:07:00 144.293
128 1980-09-25 12:08:00 137.726
129 1980-09-25 12:09:00 131.608
130 1980-09-25 12:10:00 135.320

As you see, the 1st row of res$anoms is corresponding to the 125th row of data. The timestamp columns are of the same value and the only difference is the timezone attribute.

I will try to fix the problem.

caijun commented 6 years ago

I have fixed it by PR https://github.com/twitter/AnomalyDetection/pull/92

Now the AnomalyDetectionTs keeps the original timezone attr as the input timestamp, see following output.

> attr(data$timestamp, "tzone") <- "America/New_York"
> attr(data$timestamp, "tzone")
[1] "America/New_York"
> res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = TRUE)
> attr(res$anoms$timestamp, 'tzone')
[1] "America/New_York"
> head(res$anoms)
            timestamp    anoms
1 1980-09-25 12:05:00  21.3510
2 1980-09-29 02:40:00 193.1036
3 1980-09-30 16:26:00 180.8990
4 1980-09-30 16:30:00 178.8220
5 1980-09-30 16:31:00 198.3260
6 1980-09-30 16:32:00 203.9010
> data[120:130, ]
              timestamp   count
120 1980-09-25 12:00:00 134.646
121 1980-09-25 12:01:00 157.175
122 1980-09-25 12:02:00 150.374
123 1980-09-25 12:03:00 151.579
124 1980-09-25 12:04:00 133.844
125 1980-09-25 12:05:00  21.351
126 1980-09-25 12:06:00 120.827
127 1980-09-25 12:07:00 144.293
128 1980-09-25 12:08:00 137.726
129 1980-09-25 12:09:00 131.608
130 1980-09-25 12:10:00 135.320
Maryoda2 commented 6 years ago

Hi team,

Sorry for this up, I mean really.

But with this script :

library(AnomalyDetection) data(raw_data) data <- raw_data data$timestamp <- as.POSIXct(data$timestamp) attr(data$timestamp, "tzone") <- "America/New_York" attr(data$timestamp, "tzone") res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = TRUE) attr(res$anoms$timestamp, 'tzone') res$plot

Here what I have :

library(AnomalyDetection)

data(raw_data) data <- raw_data data$timestamp <- as.POSIXct(data$timestamp) attr(data$timestamp, "tzone") <- "America/New_York" attr(data$timestamp, "tzone") [1] "America/New_York" res <- AnomalyDetectionTs(data, max_anoms = 0.002, direction = 'both', plot = TRUE) attr(res$anoms$timestamp, 'tzone') [1] "UTC" res$plot Error: Column x is a date/time and must be stored as POSIXct, not POSIXlt

I cannot understand why, I've checked all what you've said, but no result... May I have some help ?

caijun commented 6 years ago

@Maryoda2 The Error: Column x is a date/time and must be stored as POSIXct, not POSIXlt has been fixed by my PR https://github.com/twitter/AnomalyDetection/pull/92, which has not been merged into the master branch (It seems that the twitter/AnomalyDetection repository has not been maintained since Sep, 2015). You could install AnomalyDetection package that I modified.

devtools::install_github("caijun/AnomalyDetection")

Then rerunning your script will produce the expected results without errors (I tested your script).

Maryoda2 commented 6 years ago

Oh thanks a lot !

Maryoda2 commented 6 years ago

Well... in fact sometimes it writes :

Error in aggregate.data.frame(x[2], format(x[1], "%Y-%m-%d %H:%M:00"), : 'by' must be a list

caijun commented 6 years ago

@Maryoda2 could you give an example reproducing the error?

Maryoda2 commented 6 years ago

library(AnomalyDetection) data <-dems data$timestamp <- as.POSIXct(data$timestamp) attr(data$timestamp, "tzone") <- "America/New_York" attr(data$timestamp, "tzone") res <- AnomalyDetectionTs(data, max_anoms = 0.02, direction = 'both', plot = TRUE) attr(res$anoms$timestamp, 'tzone') res$plot

With another dataset, I've tried to translate your script. I created a dataset called dems and attempted to format it exactly like raw_data, but it doesn't work

caijun commented 6 years ago

Please run str(data) to see the structure of data and post the result. Otherwise without your data, I cannot reproduce the error to provide help.

Maryoda2 commented 6 years ago

str(raw_data) 'data.frame': 14398 obs. of 2 variables: $ timestamp: POSIXct, format: "1980-09-25 10:01:00" "1980-09-25 10:02:00" ... $ count : num 182 176 184 178 165 ... str(dems) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1127 obs. of 2 variables: $ timestamp: POSIXct, format: "2018-07-04 09:51:41" "2018-07-04 09:51:51" ... $ count : num 2.6 2.6 2.6 2.6 2.5 2.4 2.4 2.4 2.5 2.7 ...

caijun commented 6 years ago

This is because your timestamp is not in the format of "%Y-%m-%d %H:%M:00", which means AnomalyDetectionTs() detect anomalies every minute, while your timestamp appears to be every 10 seconds. If you really need to detect anomalies at a time resolution of less than a minute, then the code of twitter/AnomalyDetection has to be modified to support such a feature.

Maryoda2 commented 6 years ago

Oh I see, How can I edit the code of twitter/AnomalyDetection, in fact the caijun/Twitter (yours) ?

caijun commented 6 years ago

@Maryoda2 Could you share a sample of your data, which I can use for reproducing the error and testing?

I think the problem you encountered has been reported in issue https://github.com/twitter/AnomalyDetection/issues/77, which has been resolved by PRs https://github.com/twitter/AnomalyDetection/pull/98 and https://github.com/twitter/AnomalyDetection/pull/79

Although these PRs have not been merged into the twitter/AnomalyDetection master branch, but has been merged into isalgueiro/AnomalyDetection. You can try this fork of AnomalyDetection by

devtools::install_github("isalgueiro/AnomalyDetection")
Maryoda2 commented 6 years ago

Well thanks for your answer With isalgueiro it doesn't work, same error. Here is a sample

caijun commented 6 years ago

@Maryoda2 the data you provided is not enough to reproduce the error. Please provide data with at least 20000 rows.

Maryoda2 commented 6 years ago

20.000 is no more a sample, but we can't detect anomalies on dataset less than 20.000 ?

caijun commented 6 years ago

@Maryoda2 Using the data you provided, I encountered the error "Anom detection needs at least 2 periods worth of data". Hence, I need more data to make sure this error was not because of limited data.

Maryoda2 commented 6 years ago

In fact for the moment this is all what I have... So game over then ?

caijun commented 6 years ago

@Maryoda2 this error has been discussed in issues https://github.com/twitter/AnomalyDetection/issues/15 and https://github.com/twitter/AnomalyDetection/issues/42, perhaps they can provide some useful information for you to fix this error by yourself, e.g. fork this repository and modify the source code.