joshuaulrich / xts

Extensible time series class that provides uniform handling of many R time series classes by extending zoo.
http://joshuaulrich.github.io/xts/
GNU General Public License v2.0
219 stars 71 forks source link

merge.xts drops column on join with an xts with no rows #222

Closed ethanbsmith closed 1 year ago

ethanbsmith commented 6 years ago

Description

a join should always produce the union of column form the joined tables. The various types of joins alter the rows produced. specifically, a left join should always output the number of rows in the left table, and the number of columns form both the left and right tables

merge.xts does not produce the correct output columns when one of the input xts object contains no rows

I tried to find the problem, but it looks like the bulk of the code is implemented in C and beyond my abilities. let me know if i can help

Expected behavior

Expected output is that the the columns from the empty table are included, with the value specified by the fill parameter

Minimal, reproducible example

x <- xts(x = matrix(1:9, 3, 3), order.by = as.Date(17167:17169))
colnames(x) <- c("a","b","c")
### this join works correctly
y <- xts(coredata(x), order.by = index(x)+5)
merge(x,y,join="left")

### these join on the empty table produce the correct rows, but do not include columns from y
y <- x[1 == 2]
merge(x, y, join = "left")
merge(y,x,  join = "right")
merge(y,x)
merge(x,y)

Session Info

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] XML_3.98-1.9         jsonlite_1.4         curl_2.6             data.table_1.10.5    quantmod_0.4-12      TTR_0.23-2           xts_0.10-0           zoo_1.8-0            RODBC_1.3-15         doParallel_1.0.10   
[11] iterators_1.0.8      foreach_1.4.4        plotrix_3.6-6        RevoUtilsMath_10.0.0 RevoUtils_10.0.5     RevoMods_11.0.0      MicrosoftML_1.5.0    mrsdeploy_1.1.2      RevoScaleR_9.2.1     lattice_0.20-35     
[21] rpart_4.1-11         checkpoint_0.4.0    

loaded via a namespace (and not attached):
[1] codetools_0.2-15       CompatibilityAPI_1.1.0 grid_3.4.1             R6_2.2.0               tools_3.4.1            compiler_3.4.1         rtvs_1.0.0.0           mrupdate_1.0.1     
joshuaulrich commented 6 years ago

Thanks for the report! I would also note that the current behavior differs from merge.zoo(). Is zoo's behavior what you expect?

x <- zoo(x = matrix(1:9, 3, 3), order.by = as.Date(17167:17169))
colnames(x) <- c("a","b","c")
y <- zoo(coredata(x), order.by = index(x)+5)
z <- y[FALSE]

merge(x, z, all = c(TRUE, FALSE))
           a.x b.x c.x a.z b.z c.z
2017-01-01   1   4   7  NA  NA  NA
2017-01-02   2   5   8  NA  NA  NA
2017-01-03   3   6   9  NA  NA  NA

merge(z, x, all = c(FALSE, TRUE))
           a.z b.z c.z a.x b.x c.x
2017-01-01  NA  NA  NA   1   4   7
2017-01-02  NA  NA  NA   2   5   8
2017-01-03  NA  NA  NA   3   6   9

merge(z, x)
           a.z b.z c.z a.x b.x c.x
2017-01-01  NA  NA  NA   1   4   7
2017-01-02  NA  NA  NA   2   5   8
2017-01-03  NA  NA  NA   3   6   9

merge(x, z)
           a.x b.x c.x a.z b.z c.z
2017-01-01   1   4   7  NA  NA  NA
2017-01-02   2   5   8  NA  NA  NA
2017-01-03   3   6   9  NA  NA  NA
ethanbsmith commented 6 years ago

yes, the zoo merge function produces the expected output shape (number of rows and columns with NA padding). however, i think xts uses different rules for handling duplicate column names.

ethanbsmith commented 2 years ago

edge case test for all inputs having zero rows

x <- xts(x = matrix(1:9, 3, 3), order.by = as.Date(17167:17169))
colnames(x) <- c("a","b","c")
merge(x[F,], x[F,]) #expect [0,6] output
merge(x[F,], x[F,], x[F,]) #expect [0,9] output

seg faults

x <- xts(x = matrix(1:9, 3, 3), order.by = as.Date(17167:17169))
merge(x[F,], x, x[F,]) #faults if run multiple times
joshuaulrich commented 2 years ago

Your first example matches the output from merge.zoo() (except merge.zoo() converts the result from integer to numeric).

x <- xts(matrix(1:9, 3, 3), as.Date(17167:17169), dimnames = list(NULL, c("a","b","c")))
(x0 <- x[FALSE,])
##  a b c

z <- as.zoo(x)
(z0 <- z[FALSE,])
##  a b c

merge(z0, z0)
## Data:
## numeric(0)
## 
## Index:
## Date of length 0

merge(x0, x0)
## Data:
## integer(0)
## 
## Index:
## Date of length 0

@jaryan, @ggrothendieck, @zeileis, what do you think about this edge case where we merge two zoo objects with columns but zero observations? The result is an empty zoo object, but Ethan expects an object with zero observations and 6 columns.

Merging this type of zoo object with at least one zoo object that has observations returns a zoo object with the same number of columns as the sum of the columns of the inputs. So I can understand Ethan's expectation.

Thoughts?


Your second example errors for me, but it doesn't crash the R session even under valgrind and gctorture(TRUE). What's your sessionInfo()?

EDIT: Of course, it crashed right after I posted this... never mind.

merge(x0, x0, x0) 
## Error in merge.xts(x0, x0, x0) : 
##   INTEGER() can only be applied to a 'integer', not a 'double'

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] tinytest_1.3.1 xts_0.12.2.2   zoo_1.8-11    
## 
## loaded via a namespace (and not attached):
## [1] compiler_4.2.1  parallel_4.2.1  tools_4.2.1     bspm_0.3.10     grid_4.2.1      lattice_0.20-45
ethanbsmith commented 2 years ago

some color on my perspective:

  1. it conforms to relational algebra, which is how i think about R matrix merging/subsetting
  2. it requires less exception handling, as the empty output can still pass through subsequent subsetting/transform code. similar to the idea behind https://github.com/joshuaulrich/TTR/issues/68
ethanbsmith commented 2 years ago

testing on https://github.com/joshuaulrich/xts/commit/a3e9fb2b56bb1e2cfcc9067d845b064a1fe99276

x <- xts(matrix(1:9, 3, 3), as.Date(17167:17169), dimnames = list(NULL, c("a","b","c")))
merge(x, 1:3) #works fine
merge(x, x, 1:3) #errors

rest looks great!

joshuaulrich commented 2 years ago

Thanks for the testing and feedback again! These are fixed now.

I added the examples in your previous comment to the unit tests, and I'll add these too. Can you share more of the tests you're running? It would be great if you could add them to inst/tinytest/test-merge.R.

ethanbsmith commented 2 years ago

those issues all came up running through my analytics suite. will add tests if i think of anything else

current patch works for all my scenarios! Thank you!

joshuaulrich commented 2 years ago

Awesome news! Thanks for testing!