Closed rudeboybert closed 3 years ago
Yeah this definitely shouldn't be happening.
On Mon, Jul 13, 2020, 9:54 AM Albert Y. Kim notifications@github.com wrote:
Hello, I'm noticing a discrepancy in warning messages for augment.lm() applied to a y ~ x formula when the newdata data frame:
- has the y outcome variable: no warning message
- doesn't have the outcome variable y: Warning: 'newdata' had 3 rows but variables found have 234 rows.
I'm not sure if this was intentional or not. IMO there shouldn't be a warning in the latter case since this is a realistic situation where we want predicted values.
library(tidyverse) library(broom)
Prep mtcarsmtcars <- mtcars %>%
rownames_to_column(var = "automobile") %>% mutate(cyl = as.factor(cyl)) %>% select(mpg, hp, cyl)
Fit lm model with y = mpgmpg_model <- lm(mpg ~ hp + cyl, data = mtcars)
Use newdata. Output has .resid variablemtcars %>%
slice(1:3) %>% augment(mpg_model, newdata = .)#> # A tibble: 3 x 5#> mpg hp cyl .fitted .resid#>
#> 1 21 110 6 20.0 0.962#> 2 21 110 6 20.0 0.962#> 3 22.8 93 4 26.4 -3.61 Use newdata without y. Output does not have .resid variable as expected, but# NOTE warning message here:mtcars %>%
slice(1:3) %>% select(-mpg) %>% augment(mpg_model, newdata = .)#> Warning: 'newdata' had 3 rows but variables found have 234 rows#> # A tibble: 3 x 3#> hp cyl .fitted#>
#> 1 110 6 20.0#> 2 110 6 20.0#> 3 93 4 26.4 Created on 2020-07-13 by the reprex package https://reprex.tidyverse.org (v0.3.0)
Here is my sessionInfo()
R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.5
Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] broom_0.7.0.9001 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.0 purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 tibble_3.0.3 ggplot2_3.3.2 tidyverse_1.3.0
loaded via a namespace (and not attached): [1] tidyselect_1.1.0 xfun_0.15 haven_2.3.1 colorspace_1.4-1 vctrs_0.3.1 generics_0.0.2 htmltools_0.5.0 yaml_2.2.1 blob_1.2.1 rlang_0.4.7 pillar_1.4.6 withr_2.2.0 glue_1.4.1 [14] DBI_1.1.0 dbplyr_1.4.4 modelr_0.1.8 readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 rvest_0.3.5 evaluate_0.14 knitr_1.29 callr_3.4.3 ps_1.3.3 [27] fansi_0.4.1 Rcpp_1.0.5 clipr_0.7.0 backports_1.1.8 scales_1.1.1 jsonlite_1.7.0 fs_1.4.2 hms_0.5.3 digest_0.6.25 stringi_1.4.6 processx_3.4.3 grid_4.0.1 cli_2.0.2 [40] tools_4.0.1 magrittr_1.5 crayon_1.3.4 whisker_0.4 pkgconfig_2.0.3 ellipsis_0.3.1 xml2_1.3.2 reprex_0.3.0 lubridate_1.7.9 assertthat_0.2.1 rmarkdown_2.3 httr_1.4.1 rstudioapi_0.11 [53] R6_2.4.1 compiler_4.0.1
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tidymodels/broom/issues/896, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTBG2YOHRKJYXFSQKTRIRDR3MN25ANCNFSM4OYSAU7A .
In ?augment.lm
you state:
Users may pass data to augment via either the data argument or the newdata argument. If the user passes data to the data argument, it must be exactly the data that was used to fit the model object. Pass datasets to newdata to augment data that was not used during model fitting. This still requires that all columns used to fit the model are present.
So it seems this is intentional behavior.
From a statistical perspective however, IMO that last sentence should be less restrictive: This still requires that all predictor / explanatory / independent variable columns used to fit the model are present.
Would you be open to a pull request to try to address this?
Thanks for bringing this up, @rudeboybert!
Would be very open to a PR, yes! Looking through the blames, looks like this was intentional at one point, but is no longer consistent with behavior from the rest of tidymodels.
Yeah this is missing a delete.response() call or something like that somewhere. The logic should be to check for all predictors only. An update to the doc as well would be great.
On Mon, Jul 13, 2020, 10:57 AM Simon P. Couch notifications@github.com wrote:
Thanks for bringing this up, @rudeboybert https://github.com/rudeboybert !
Would be very open to a PR, yes! Looking through the blames, looks like this was intentional at one point, but is no longer consistent with behavior from the rest of tidymodels.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tidymodels/broom/issues/896#issuecomment-657643953, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTBG24U734S5HESG5RK2VDR3MVHNANCNFSM4OYSAU7A .
Giving credit where credit is due: I should note that it was my student @mariumtapal who narrowed down this source of this error!
Thanks for catching this, @mariumtapal! :-)
I will take a look and change the fix to this, but struggled with an R 4.0.0 upgrade and reproducing this. Will eventually get to this.
This issue has been automatically closed due to inactivity.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
Hello, I'm noticing a discrepancy in warning messages for
augment.lm()
applied to ay ~ x
formula when thenewdata
data frame:y
outcome variable: no warning messagey
:Warning: 'newdata' had 3 rows but variables found have 234 rows
.I'm not sure if this was intentional or not. IMO there shouldn't be a warning in the latter case since this is a realistic situation where we want predicted values.
Created on 2020-07-13 by the reprex package (v0.3.0)
Here is my
sessionInfo()