tidymodels / broom

Convert statistical analysis objects from R into tidy format
https://broom.tidymodels.org
Other
1.46k stars 304 forks source link

augment error with `na.action = na.exclude` in `lm` #1187

Open wbvguo opened 10 months ago

wbvguo commented 10 months ago

Dear broom maintainer,

the problem

I was runnning lm on a dataset with NA values, and found augment doesn't work with na.action = na.exclude

code

df <- data.frame(
  id = 1:10,
  x = rnorm(10),
  y = rnorm(10)
)

df$x[5] = NA

broom::augment(lm(y~x, data = df, na.action = na.exclude))

output

> Error in `$<-`:
! Assigned data `predict(x, na.action = na.pass, ...) %>% unname()` must be compatible with existing data.
✖ Existing data has 9 rows.
✖ Assigned data has 10 rows.
ℹ Only vectors of size 1 are recycled.
Caused by error in `vectbl_recycle_rhs_rows()`:
! Can't recycle input of size 10 to size 9.
Run `rlang::last_trace()` to see where the error occurred.

remove the na.action = na.exclude option will work. Actually, the following z1 and z2

z1 = lm(y~x, data = df, na.action = na.exclude)
z2 = lm(y~x, data = df) # the default na.action is na.omit

have the same model, coefficients, residuals components, making me really wonder how the na.exclude and na.omit will influence augment's behavior

I'm not entirely sure if the issue we saw above originating from the augment function or the lm function. I would greatly appreciate any insights or guidance you could offer on this matter. Thank you in advance for your assistance.

Thanks!

sessioninfo

> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] purrr_1.0.2   broom_1.0.5   tidyr_1.3.0   dplyr_1.1.3   furrr_0.3.1   future_1.33.0

loaded via a namespace (and not attached):
 [1] parallelly_1.36.0 rstudioapi_0.15.0 knitr_1.44        magrittr_2.0.3    tidyselect_1.2.0  R6_2.5.1          rlang_1.1.1       fansi_1.0.5      
 [9] globals_0.16.2    tools_4.2.3       parallel_4.2.3    xfun_0.40         utf8_1.2.4        cli_3.6.1         digest_0.6.33     tibble_3.2.1     
[17] lifecycle_1.0.3   vctrs_0.6.4       codetools_0.2-19  glue_1.6.2        compiler_4.2.3    pillar_1.9.0      generics_0.1.3    backports_1.4.1  
[25] listenv_0.9.0     pkgconfig_2.0.3 
simonpcouch commented 10 months ago

Thanks for the issue, @wbvguo!

You may find the documentation helpful here:

When the modeling was performed with na.action = "na.exclude", one should provide the original data as a second argument, at which point the augmented data will contain those rows (typically with NAs in place of the new columns).

As in:

library(broom)

df <- data.frame(
  id = 1:10,
  x = rnorm(10),
  y = rnorm(10)
)

df$x[5] = NA

m <- lm(y~x, data = df, na.action = na.exclude)
augment(m, df)
#> # A tibble: 10 × 9
#>       id      x      y .fitted .resid  .hat .sigma  .cooksd .std.resid
#>    <int>  <dbl>  <dbl>   <dbl>  <dbl> <dbl>  <dbl>    <dbl>      <dbl>
#>  1     1  0.593  0.278 -0.480   0.758 0.269  0.629  0.321        1.32 
#>  2     2  0.185 -0.639 -0.316  -0.323 0.168  0.712  0.0281      -0.527
#>  3     3 -0.830 -0.224  0.0929 -0.316 0.135  0.713  0.0200      -0.506
#>  4     4 -1.86   1.40   0.509   0.896 0.421  0.545  1.11         1.75 
#>  5     5 NA     -0.474 NA      NA     0      0.672 NA           NA    
#>  6     6 -0.687  0.583  0.0355  0.547 0.121  0.686  0.0519       0.868
#>  7     7 -0.680 -0.206  0.0324 -0.238 0.120  0.719  0.00977     -0.378
#>  8     8 -1.53  -0.596  0.375  -0.971 0.294  0.552  0.614       -1.72 
#>  9     9  0.605 -0.993 -0.485  -0.509 0.273  0.684  0.148       -0.887
#> 10    10  0.332 -0.219 -0.375   0.156 0.199  0.723  0.00834      0.259

Created on 2024-01-22 with reprex v2.1.0

Looks like we fail to raise an informative warning here, as is documented. Will make a note to look into this. :)