joshuaulrich / xts

Extensible time series class that provides uniform handling of many R time series classes by extending zoo.
http://joshuaulrich.github.io/xts/
GNU General Public License v2.0
220 stars 71 forks source link

Odd merge behavior #284

Closed butterflyology closed 5 years ago

butterflyology commented 5 years ago

Description

I want to merge two xts objects: one a zero-length object and the second the one that contains the data I want to regularize. When I investigate the merged object many of the indices from the zero-length object are gone and values are shifted forward to odd times. =

Expected behavior

I expect the merged object to contain all of the indices of the regularized object and fill in NAs with data if there is a corresponding value, and otherwise leave NAs in place if no datum with the same index exists.

Minimal, reproducible example

library("tidyverse")
library("xts")

PotC <- read_csv("https://raw.githubusercontent.com/butterflyology/TouringPlans_data/master/data/pirates_of_caribbean.csv", col_names = TRUE, col_types = cols(date = col_date(format = "%m/%d/%Y"), datetime = col_datetime(format = "%Y-%m-%d %H:%M:%S"), SPOSTMIN = col_integer()))

glimpse(PotC)
Observations: 237,486
Variables: 3
$ date     <date> 2012-01-01, 2012-01-01, 2012-01-01, 2012-01-0…
$ datetime <dttm> 2012-01-01 10:14:00, 2012-01-01 12:13:00, 201…
$ SPOSTMIN <int> 5, 5, 10, 10, 5, 5, 0, 5, 5, 5, 20, 25, 25, 30…

PotC_xts <- xts(PotC[, 3], order.by = PotC$datetime)
colnames(PotC_xts) <- "TIME"

head(PotC_xts)
                    TIME
2012-01-01 10:14:00    5
2012-01-01 12:13:00    5
2012-01-01 13:50:00   10
2012-01-01 14:42:00   10
2012-01-01 20:40:00    5
2012-01-01 21:40:00    5

PotC_xts <- align.time(PotC_xts, n = 5 * 60) # This shifts to the next nearest even if the observation is at a 5 minute time.

# create a regularized data set every 5 minutes
datetimes <- seq(from = as.POSIXct("2012-01-01 08:00:00"), to = as.POSIXct("2018-06-18 22:00:00"), by = "5 min")

head(datetimes)
[1] "2012-01-01 08:00:00 PST" "2012-01-01 08:05:00 PST"
[3] "2012-01-01 08:10:00 PST" "2012-01-01 08:15:00 PST"
[5] "2012-01-01 08:20:00 PST" "2012-01-01 08:25:00 PST"

tail(datetimes)
[1] "2018-06-18 21:35:00 PDT" "2018-06-18 21:40:00 PDT"
[3] "2018-06-18 21:45:00 PDT" "2018-06-18 21:50:00 PDT"
[5] "2018-06-18 21:55:00 PDT" "2018-06-18 22:00:00 PDT"

str(datetimes)
 POSIXct[1:679837], format: "2012-01-01 08:00:00" "2012-01-01 08:05:00" ...

# create a zero-width xts object
regularized <- xts(, order.by = datetimes)
regularized <- regularized["T08:00/T22:00"] # constrict to "business hours"

regularized[1:6]
                    [,1]
2012-01-01 08:00:00   NA
2012-01-01 08:05:00   NA
2012-01-01 08:10:00   NA
2012-01-01 08:15:00   NA
2012-01-01 08:20:00   NA
2012-01-01 08:25:00   NA

str(regularized)
An ‘xts’ object on 2012-01-01 08:00:00/2018-06-18 22:00:00 containing:
  Data: logi [1:399009, 1] NA NA NA NA NA NA ...
  Indexed by objects of class: [POSIXct,POSIXt] TZ: 
  xts Attributes:  
 NULL

# now merge the data sets
merged <- merge.xts(PotC_xts, regularized)

dim(merged)
[1] 598906      2 # more observations than the regularized object, also two columns

head(merged, n = 10)
                    TIME regularized
2012-01-01 10:15:00    5          NA
2012-01-01 12:15:00    5          NA
2012-01-01 13:55:00   10          NA
2012-01-01 14:45:00   10          NA
2012-01-01 16:00:00   NA          NA
2012-01-01 16:05:00   NA          NA
2012-01-01 16:10:00   NA          NA
2012-01-01 16:15:00   NA          NA
2012-01-01 16:20:00   NA          NA
2012-01-01 16:25:00   NA          NA

Session Info

R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] xts_0.11-2      zoo_1.8-4       forcats_0.3.0  
 [4] stringr_1.3.1   dplyr_0.7.8     purrr_0.2.5    
 [7] readr_1.3.1     tidyr_0.8.2     tibble_2.0.0   
[10] ggplot2_3.1.0   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       cellranger_1.1.0 pillar_1.3.1    
 [4] compiler_3.5.2   plyr_1.8.4       bindr_0.1.1     
 [7] tools_3.5.2      jsonlite_1.6     lubridate_1.7.4 
[10] nlme_3.1-137     gtable_0.2.0     lattice_0.20-38 
[13] pkgconfig_2.0.2  rlang_0.3.0.1    cli_1.0.1       
[16] rstudioapi_0.8   curl_3.2         yaml_2.2.0      
[19] haven_2.0.0      bindrcpp_0.2.2   withr_2.1.2     
[22] xml2_1.2.0       httr_1.4.0       generics_0.0.2  
[25] hms_0.4.2        grid_3.5.2       tidyselect_0.2.5
[28] glue_1.3.0       R6_2.3.0         fansi_0.4.0     
[31] readxl_1.2.0     modelr_0.1.2     magrittr_1.5    
[34] backports_1.1.3  scales_1.0.0     rvest_0.3.2     
[37] assertthat_0.2.0 colorspace_1.3-2 utf8_1.1.4      
[40] stringi_1.2.4    lazyeval_0.2.1   munsell_0.5.0   
[43] broom_0.5.1      crayon_1.3.4    
joshuaulrich commented 5 years ago

Can you please provide a minimal example? I'm not likely to install 40+ packages in order to run your example. I'm also curious whether the issue exists when all those other packages aren't loaded and/or attached.

It looks like there may be an issue with the time-of-day subsetting, since regularlized shouldn't have any columns.

Also, it's bad practice to call methods (e.g. merge.xts()) explicitly. There's no guarantee of what it will do if you call merge.xts() with non-xts objects.

butterflyology commented 5 years ago

Thanks for the reply. You will note that I loaded two packages, tidyverse (which is admittedly rather large and cumbersome) and xts. I guess I thought that two packages was closer to minimal than loading my entire workflow but I can report that if you just load readr the same result exists.

I'll adjust the merge.xts code to reflect best practices. Thanks again.

joshuaulrich commented 5 years ago

tidyverse is a meta-package that loads 40+ other packages. xts only loads itself and zoo.

harvey131 commented 5 years ago

I think you might be trying to merge the xts with just the index of regularized. Also, try changing the join argument to outer/inner/left/right depending on what you want.

dim(merge.xts(PotC_xts, index(regularized)))
dim(merge.xts(PotC_xts, index(regularized), join='inner'))
dim(merge.xts(PotC_xts, index(regularized), join='right'))
butterflyology commented 5 years ago

Thank you, @harvey131. If I understand you correctly I still get an error. When I enter the following code:

merged <- merge(PotC_xts, index(regularized), join = "right")

I do get the correct dimensions 399009 1 but there is a time shift (the first index in the xts object is shifted to 16:00:00 hours and the data are all filled in with NAs. The shift in time happens regardless if I "trim" the indices to only operating hours.

harvey131 commented 5 years ago

If you use join = 'right' it will produce an NA if the time doesnt exist in the 'x' argument PotC_xts. Are you trying to get the prevailing value if the 'y' time doesn't exist in 'x'?

As a separate issue, maybe merge.xts should have documentation added that 'y' can be a vector of POSIXct.

> merged <- merge(PotC_xts, index(regularized), join = "right")
> head(merged, n = 15)
                    TIME
2012-01-01 13:00:00   NA
2012-01-01 13:05:00   NA
2012-01-01 13:10:00   NA
2012-01-01 13:15:00   NA
2012-01-01 13:20:00   NA
2012-01-01 13:25:00   NA
2012-01-01 13:30:00   NA
2012-01-01 13:35:00   NA
2012-01-01 13:40:00   NA
2012-01-01 13:45:00   NA
2012-01-01 13:50:00   NA
2012-01-01 13:55:00   10
2012-01-01 14:00:00   NA
2012-01-01 14:05:00   NA
2012-01-01 14:10:00   NA
# in this example the timestamps are printed as GMT
> format(index(merged)[1], tz='GMT')
[1] "2012-01-01 13:00:00"
butterflyology commented 5 years ago

Thanks @harvey131.

I think that I am doing a poor job of explaining, thank you for your patience.

Your example looks spot on and just what I want, but I am not getting the same result. I use all of the code in the example above and then change my merge to match yours:

merged <- merge(PotC_xts, index(regularized), join = "right")
head(merged, n = 15)
                    TIME
2012-01-01 16:00:00   NA
2012-01-01 16:05:00   NA
2012-01-01 16:10:00   NA
2012-01-01 16:15:00   NA
2012-01-01 16:20:00   NA
2012-01-01 16:25:00   NA
2012-01-01 16:30:00   NA
2012-01-01 16:35:00   NA
2012-01-01 16:40:00   NA
2012-01-01 16:45:00   NA
2012-01-01 16:50:00   NA
2012-01-01 16:55:00   NA
2012-01-01 17:00:00   NA
2012-01-01 17:05:00   NA
2012-01-01 17:10:00   NA
Warning message:
timezone of object (UTC) is different than current timezone (). 
joshuaulrich commented 5 years ago

@butterflyology I think your different result is because the index of PotC_xts and/or regularized has a UTC timezone in your last comment. No timezone was specified in the examples in your other comments, so they were using your local time (US/Pacific).

Maybe your subsequent runs added a non-local timezone to one of the objects?

For what it's worth, here's my minimal reproducible example:

library("xts")
PotC <- read.csv("https://raw.githubusercontent.com/butterflyology/TouringPlans_data/master/data/pirates_of_caribbean.csv", as.is = TRUE)

PotC_xts <- xts(PotC$SPOSTMIN, as.POSIXct(PotC$datetime))
colnames(PotC_xts) <- "TIME"
PotC_xts <- align.time(PotC_xts, n = 5 * 60)

# create a zero-width xts object
regularized <- xts(, seq(from = as.POSIXct("2012-01-01 08:00:00"),
                         to = as.POSIXct("2018-06-18 22:00:00"), by = "5 min"))
regularized <- regularized["T08:00/T22:00"] # constrict to "business hours"
merged <- merge(PotC_xts, regularized)

@harvey131 Thanks for your help with this confusion! I think your suggestion to use join = "right" is correct. Regarding y being POSIXct, I would prefer that behavior remain undocumented. I don't particularly like it, since it's easy to construct an empty xts object instead.

butterflyology commented 5 years ago

Thanks you, @joshuaulrich and @harvey131.

I'm think that the lubridate or something else in the tidyverse is introducing an error into the mix. It seems to assume all dates/times to be UTC regardless of what we tell xts.

joshuaulrich commented 5 years ago

@butterflyology, do you think it's safe to say this isn't a bug in xts? Please note that I've investigated the behavior that subsetting a zero-width xts object returns an object with a column of NA and found that it is consistent with zoo... and therefore not a bug.

butterflyology commented 5 years ago

@joshuaulrich I think that you are correct. Thanks for your assistance.

joshuaulrich commented 5 years ago

Thanks @butterflyology, I appreciate that you followed-up!