tidymodels / hardhat

Construct Modeling Packages
https://hardhat.tidymodels.org
Other
101 stars 15 forks source link

`mold()` inconsistently preserves (with XY method) or ignores (with formula method) non-base vector classes #219

Closed mikemahoney218 closed 1 year ago

mikemahoney218 commented 1 year ago

The problem

It seems like the formula method for mold ignores non-base vector classes (returning an output of a different class), while other methods for mold preserve them. I believe this is inherited from stats::model.frame(), which also silently ignores non-base vector classes. I think ideally the methods would perform the same in this situation.

Reproducible example

# Non-base classes are inconsistently preserved and ignored across methods
orange_units <- Orange
orange_units$age <- units::set_units(orange_units$age, "m")
head(orange_units)
#>   Tree      age circumference
#> 1    1  118 [m]            30
#> 2    1  484 [m]            58
#> 3    1  664 [m]            87
#> 4    1 1004 [m]           115
#> 5    1 1231 [m]           120
#> 6    1 1372 [m]           142

# XY method preserves non-standard class:
xy_interface <- hardhat::mold(orange_units["age"], orange_units["Tree"])
class(xy_interface$predictors$age)
#> [1] "units"

# Formula method converts to bare numeric:
formula_interface <- hardhat::mold(Tree ~ age, orange_units)
class(formula_interface$predictors$age)
#> [1] "numeric"

# This also happens with vctrs classes:
orange_example <- Orange
orange_example$age <- vctrs::new_vctr(orange_example$age, class = "example")
head(orange_example)
#>   Tree  age circumference
#> 1    1  118            30
#> 2    1  484            58
#> 3    1  664            87
#> 4    1 1004           115
#> 5    1 1231           120
#> 6    1 1372           142

# XY method preserves non-standard class:
xy_interface <- hardhat::mold(orange_example["age"], orange_example["Tree"])
class(xy_interface$predictors$age)
#> [1] "example"    "vctrs_vctr"

# Formula method converts to bare numeric:
formula_interface <- hardhat::mold(Tree ~ age, orange_example)
class(formula_interface$predictors$age)
#> [1] "numeric"

Created on 2022-12-20 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.2 Patched (2022-11-10 r83330) #> os Ubuntu 22.04.1 LTS #> system x86_64, linux-gnu #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York #> date 2022-12-20 #> pandoc 2.19.2 @ /usr/lib/rstudio/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.5.0 2022-12-20 [1] CRAN (R 4.2.2) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.2) #> evaluate 0.19 2022-12-13 [1] CRAN (R 4.2.2) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.2) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.2) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.2) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.2) #> hardhat 1.2.0.9000 2022-12-20 [1] Github (tidymodels/hardhat@c2c896c) #> highr 0.9 2021-04-16 [1] CRAN (R 4.2.2) #> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.2) #> knitr 1.41 2022-11-18 [1] CRAN (R 4.2.2) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.2) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.2) #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.2) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.2) #> purrr 0.3.5 2022-10-06 [1] CRAN (R 4.2.2) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.2) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.2) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.2) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.2) #> Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.2) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.2) #> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.2) #> rmarkdown 2.19 2022-12-15 [1] CRAN (R 4.2.2) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.2) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.2) #> stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.2) #> stringr 1.5.0 2022-12-02 [1] CRAN (R 4.2.2) #> styler 1.8.1 2022-11-07 [1] CRAN (R 4.2.2) #> tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.2) #> units 0.8-1 2022-12-10 [1] CRAN (R 4.2.2) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.2) #> vctrs 0.5.1 2022-11-16 [1] CRAN (R 4.2.2) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.2) #> xfun 0.35 2022-11-16 [1] CRAN (R 4.2.2) #> yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.2) #> #> [1] /home/mikemahoney218/R/x86_64-pc-linux-gnu-library/4.2 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
DavisVaughan commented 1 year ago

For better or for worse, the formula method uses model.matrix() with very few changes to have maximum compatibility with base R. Otherwise we would end up rewriting model.matrix() entirely and that is hard enough that we decided it wasn't worth it - so instead we (knowingly) inherit the quirks of model.matrix(). I do at least mention that model.matrix() is run here https://hardhat.tidymodels.org/reference/default_formula_blueprint.html#mold

We generally recommend that you use a recipe if you have nonstandard types

github-actions[bot] commented 1 year ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.