augment.lm warning message discrepancy when newdata has/doesn't have outcome variable

rudeboybert commented 4 years ago

Hello, I'm noticing a discrepancy in warning messages for augment.lm() applied to a y ~ x formula when the newdata data frame:

has the y outcome variable: no warning message
doesn't have the outcome variable y: Warning: 'newdata' had 3 rows but variables found have 234 rows.

I'm not sure if this was intentional or not. IMO there shouldn't be a warning in the latter case since this is a realistic situation where we want predicted values.

library(tidyverse)
library(broom)

# Prep mtcars
mtcars <- mtcars %>%
  rownames_to_column(var = "automobile") %>%
  mutate(cyl = as.factor(cyl)) %>% 
  select(mpg, hp, cyl)

# Fit lm model with y = mpg
mpg_model <- lm(mpg ~ hp + cyl, data = mtcars)

# Use newdata. Output has .resid variable
mtcars %>% 
  slice(1:3) %>% 
  augment(mpg_model, newdata = .)
#> # A tibble: 3 x 5
#>     mpg    hp cyl   .fitted .resid
#>   <dbl> <dbl> <fct>   <dbl>  <dbl>
#> 1  21     110 6        20.0  0.962
#> 2  21     110 6        20.0  0.962
#> 3  22.8    93 4        26.4 -3.61

# Use newdata without y. Output does not have .resid variable as expected, but
# NOTE warning message here:
mtcars %>% 
  slice(1:3) %>%
  select(-mpg) %>% 
  augment(mpg_model, newdata = .)
#> Warning: 'newdata' had 3 rows but variables found have 234 rows
#> # A tibble: 3 x 3
#>      hp cyl   .fitted
#>   <dbl> <fct>   <dbl>
#> 1   110 6        20.0
#> 2   110 6        20.0
#> 3    93 4        26.4

^{Created on 2020-07-13 by the reprex package (v0.3.0)}

Here is my sessionInfo()

R version 4.0.1 (2020-06-06)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] broom_0.7.0.9001 forcats_0.5.0    stringr_1.4.0    dplyr_1.0.0      purrr_0.3.4      readr_1.3.1      tidyr_1.1.0      tibble_3.0.3     ggplot2_3.3.2    tidyverse_1.3.0 

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0 xfun_0.15        haven_2.3.1      colorspace_1.4-1 vctrs_0.3.1      generics_0.0.2   htmltools_0.5.0  yaml_2.2.1       blob_1.2.1       rlang_0.4.7      pillar_1.4.6     withr_2.2.0      glue_1.4.1      
[14] DBI_1.1.0        dbplyr_1.4.4     modelr_0.1.8     readxl_1.3.1     lifecycle_0.2.0  munsell_0.5.0    gtable_0.3.0     cellranger_1.1.0 rvest_0.3.5      evaluate_0.14    knitr_1.29       callr_3.4.3      ps_1.3.3        
[27] fansi_0.4.1      Rcpp_1.0.5       clipr_0.7.0      backports_1.1.8  scales_1.1.1     jsonlite_1.7.0   fs_1.4.2         hms_0.5.3        digest_0.6.25    stringi_1.4.6    processx_3.4.3   grid_4.0.1       cli_2.0.2       
[40] tools_4.0.1      magrittr_1.5     crayon_1.3.4     whisker_0.4      pkgconfig_2.0.3  ellipsis_0.3.1   xml2_1.3.2       reprex_0.3.0     lubridate_1.7.9  assertthat_0.2.1 rmarkdown_2.3    httr_1.4.1       rstudioapi_0.11 
[53] R6_2.4.1         compiler_4.0.1

alexpghayes commented 4 years ago

Yeah this definitely shouldn't be happening.

On Mon, Jul 13, 2020, 9:54 AM Albert Y. Kim notifications@github.com wrote:

Hello, I'm noticing a discrepancy in warning messages for augment.lm() applied to a y ~ x formula when the newdata data frame:

has the y outcome variable: no warning message

doesn't have the outcome variable y: Warning: 'newdata' had 3 rows but variables found have 234 rows.

I'm not sure if this was intentional or not. IMO there shouldn't be a warning in the latter case since this is a realistic situation where we want predicted values.

library(tidyverse) library(broom)

Prep mtcarsmtcars <- mtcars %>%

rownames_to_column(var = "automobile") %>% mutate(cyl = as.factor(cyl)) %>% select(mpg, hp, cyl)

Fit lm model with y = mpgmpg_model <- lm(mpg ~ hp + cyl, data = mtcars)

Use newdata. Output has .resid variablemtcars %>%

slice(1:3) %>% augment(mpg_model, newdata = .)#> # A tibble: 3 x 5#> mpg hp cyl .fitted .resid#> #> 1 21 110 6 20.0 0.962#> 2 21 110 6 20.0 0.962#> 3 22.8 93 4 26.4 -3.61

Use newdata without y. Output does not have .resid variable as expected, but# NOTE warning message here:mtcars %>%

slice(1:3) %>% select(-mpg) %>% augment(mpg_model, newdata = .)#> Warning: 'newdata' had 3 rows but variables found have 234 rows#> # A tibble: 3 x 3#> hp cyl .fitted#> #> 1 110 6 20.0#> 2 110 6 20.0#> 3 93 4 26.4

Created on 2020-07-13 by the reprex package https://reprex.tidyverse.org (v0.3.0)

Here is my sessionInfo()

R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.5

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] broom_0.7.0.9001 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.0 purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 tibble_3.0.3 ggplot2_3.3.2 tidyverse_1.3.0

loaded via a namespace (and not attached): [1] tidyselect_1.1.0 xfun_0.15 haven_2.3.1 colorspace_1.4-1 vctrs_0.3.1 generics_0.0.2 htmltools_0.5.0 yaml_2.2.1 blob_1.2.1 rlang_0.4.7 pillar_1.4.6 withr_2.2.0 glue_1.4.1 [14] DBI_1.1.0 dbplyr_1.4.4 modelr_0.1.8 readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 rvest_0.3.5 evaluate_0.14 knitr_1.29 callr_3.4.3 ps_1.3.3 [27] fansi_0.4.1 Rcpp_1.0.5 clipr_0.7.0 backports_1.1.8 scales_1.1.1 jsonlite_1.7.0 fs_1.4.2 hms_0.5.3 digest_0.6.25 stringi_1.4.6 processx_3.4.3 grid_4.0.1 cli_2.0.2 [40] tools_4.0.1 magrittr_1.5 crayon_1.3.4 whisker_0.4 pkgconfig_2.0.3 ellipsis_0.3.1 xml2_1.3.2 reprex_0.3.0 lubridate_1.7.9 assertthat_0.2.1 rmarkdown_2.3 httr_1.4.1 rstudioapi_0.11 [53] R6_2.4.1 compiler_4.0.1

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tidymodels/broom/issues/896, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTBG2YOHRKJYXFSQKTRIRDR3MN25ANCNFSM4OYSAU7A .

rudeboybert commented 4 years ago

In ?augment.lm you state:

Users may pass data to augment via either the data argument or the newdata argument. If the user passes data to the data argument, it must be exactly the data that was used to fit the model object. Pass datasets to newdata to augment data that was not used during model fitting. This still requires that all columns used to fit the model are present.

So it seems this is intentional behavior.

From a statistical perspective however, IMO that last sentence should be less restrictive: This still requires that all predictor / explanatory / independent variable columns used to fit the model are present.

rudeboybert commented 4 years ago

Would you be open to a pull request to try to address this?

simonpcouch commented 4 years ago

Thanks for bringing this up, @rudeboybert!

Would be very open to a PR, yes! Looking through the blames, looks like this was intentional at one point, but is no longer consistent with behavior from the rest of tidymodels.

alexpghayes commented 4 years ago

Yeah this is missing a delete.response() call or something like that somewhere. The logic should be to check for all predictors only. An update to the doc as well would be great.

On Mon, Jul 13, 2020, 10:57 AM Simon P. Couch notifications@github.com wrote:

Thanks for bringing this up, @rudeboybert https://github.com/rudeboybert !

Would be very open to a PR, yes! Looking through the blames, looks like this was intentional at one point, but is no longer consistent with behavior from the rest of tidymodels.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tidymodels/broom/issues/896#issuecomment-657643953, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTBG24U734S5HESG5RK2VDR3MVHNANCNFSM4OYSAU7A .

rudeboybert commented 4 years ago

Giving credit where credit is due: I should note that it was my student @mariumtapal who narrowed down this source of this error!

simonpcouch commented 4 years ago

Thanks for catching this, @mariumtapal! :-)

alexpghayes commented 4 years ago

I will take a look and change the fix to this, but struggled with an R 4.0.0 upgrade and reproducing this. Will eventually get to this.

github-actions[bot] commented 3 years ago

This issue has been automatically closed due to inactivity.

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

tidymodels / broom

augment.lm warning message discrepancy when newdata has/doesn't have outcome variable #896

Prep mtcarsmtcars <- mtcars %>%

Fit lm model with y = mpgmpg_model <- lm(mpg ~ hp + cyl, data = mtcars)

Use newdata. Output has .resid variablemtcars %>%

Use newdata without y. Output does not have .resid variable as expected, but# NOTE warning message here:mtcars %>%