c() should flatten an xts?

shabbychef commented 4 years ago

For an `xts` object, `c(object)` behaves differently

I found a future bug related to checking for non-scalar logicals. This may be an error in xts or in stats, or some mismatch in between. The following throws an error:

Sys.setenv('_R_CHECK_LENGTH_1_CONDITION_'='true')
Sys.setenv('_R_CHECK_LENGTH_1_LOGIC2_'='true')
library(xts)
set.seed(1234)
n <- 20
X <- xts(rnorm(n),order.by=Sys.Date() + seq_len(n))
XX <- cbind(X,X^2)
mod <- lm(XX ~ 1)
summary.lm(mod)

I have tracked down the problem in summary.lm as follows:

z <- object
# ...
f <- z$fitted.values
# ...
if (is.finite(resvar) && resvar < (mean(f)^2 + var(c(f))) * 1e-30) 
    warning("essentially perfect fit: summary may be unreliable")

The fitted values will be an xts object, and the call to c(f) results in that xts object back again; computing the variance then gives a 2x2 matrix instead of a scalar and the scalar logical check can throw an error. I should note that you do not get an error if you cast the xts to matrix first, this runs fine:

YY <- as.matrix(XX)
mod2 <- lm(YY ~ 1)
summary.lm(mod2)

In this case c(f) returns an array, the variance of which is a scalar.

Expected behavior

Not throw an error in summary.lm. Perhaps behave like a matrix with respect to c(), or provide a convincing case to stats package maintainers to fix their code.

Minimal, reproducible example

Sys.setenv('_R_CHECK_LENGTH_1_CONDITION_'='true')
Sys.setenv('_R_CHECK_LENGTH_1_LOGIC2_'='true')
library(xts)
set.seed(1234)
n <- 20
X <- xts(rnorm(n),order.by=Sys.Date() + seq_len(n))
XX <- cbind(X,X^2)
mod <- lm(XX ~ 1)
summary.lm(mod)

Session Info

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xts_0.11-2     zoo_1.8-6      colorout_1.1-2 fortunes_1.5-4 devtools_2.2.1 usethis_1.5.1  drat_0.1.5    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3        magrittr_1.5      pkgload_1.0.2     lattice_0.20-38   R6_2.4.1          rlang_0.4.1       tools_3.6.2       grid_3.6.2       
 [9] pkgbuild_1.0.6    sessioninfo_1.1.1 cli_1.1.0         withr_2.1.2       ellipsis_0.3.0    remotes_2.1.0     assertthat_0.2.1  digest_0.6.23    
[17] rprojroot_1.3-2   crayon_1.3.4      processx_3.4.1    callr_3.3.2       fs_1.3.1          ps_1.3.0          testthat_2.3.0    memoise_1.1.0    
[25] glue_1.3.1        compiler_3.6.2    desc_1.2.0        backports_1.1.5   prettyunits_1.0.2

shabbychef commented 4 years ago

I would propose to stats maintainers to replace var(c(f)) with min(var(c(f)))?

joshuaulrich commented 4 years ago

Thanks for the report. The problem is that summary.lm() is using c() for its side-effect of dropping attributes (including dim).

c.xts() behaves consistently with c.zoo(), which means it just calls rbind(). Changing the behavior of c.xts() would be inconsistent with zoo, and would likely break user code.

The best solution I can think of is to change var(c(f)) to var(as.vector(f)). I'll submit a request on R's bugzilla tracker.

joshuaulrich commented 4 years ago

I did some digging before posting. I think I'll have a hard time convincing R-core to change summary.lm(). That's a method to summary(), and you're calling it on a "mlm" object. Calling summary(mod) works, because it dispatches to stats:::summary.mlm().

I'm going to close this because it's not a bug in xts, and I don't think it's going to affect any change in base R. Please let me know if I've missed something. I'm happy to continue the discussion. We can re-open, if needed.

Sys.setenv('_R_CHECK_LENGTH_1_CONDITION_'='true')
Sys.setenv('_R_CHECK_LENGTH_1_LOGIC2_'='true')
library(xts)
set.seed(1234)
n <- 20
X <- xts(rnorm(n),order.by=Sys.Date() + seq_len(n))
XX <- cbind(X,X^2)
mod <- lm(XX ~ 1)
summary(mod)
# Response X :
# 
# Call:
# lm(formula = X ~ 1)
# 
# Residuals:
#              X
# Min    -2.0950
# 1Q     -0.5997
# Median -0.2782
# 3Q      0.5660
# Max     2.6665
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)  -0.2507     0.2267  -1.106    0.283
# 
# Residual standard error: 1.014 on 19 degrees of freedom
# 
# 
# Response X.2 :
# 
# Call:
# lm(formula = X.2 ~ 1)
# 
# Residuals:
#             X.2
# Min    -1.03509
# 1Q     -0.79025
# Median -0.57280
# 3Q     -0.09958
# Max     4.79701
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)  
# (Intercept)   1.0392     0.3656   2.843   0.0104 *
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 1.635 on 19 degrees of freedom
#

shabbychef commented 4 years ago

Thanks for digging into this. The big reveal is that I wasn't calling summary.lm, but rather a method of the sandwich package (I think it was bread) was explicitly calling summary.lm for me. I didn't want to bring sandwich into this if I could avoid doing so.

joshuaulrich / xts