tidymodels / hardhat

Construct Modeling Packages
https://hardhat.tidymodels.org
Other
101 stars 16 forks source link

Bug Report: `mold()` incorrectly thinks there is an interaction on the LHS of the formula #173

Closed ddsjoberg closed 2 years ago

ddsjoberg commented 2 years ago

Hello! I am writing a modeling package and using hardhat to guide the structure to make it compatible with common methods (e.g. passing a formula and a data frame) and with recipes.

I ran into an issue with the outcome specification in hardhat::mold(). I've included an example using the {survival} package that illustrates the issue.

When the survival package is loaded, and the outcome is specified with a formula using the Surv() function, mold prepares the data without error. But if the package prefix is used in the outcome specification, survival::Surv(), the mold() function thinks there is an interaction specification on the LHS of the formula and returns an error (reprex below).

I am not sure if supporting Surv() was an accident or if the error with survival::Surv() is a bug 🤷🏼 Thank you!

library(survival)

# this does NOT work
hardhat::mold(survival::Surv(time, status) ~ age, data = lung)
#> Error: Interaction terms cannot be specified on the LHS of `formula`. The following interaction terms were found: 'survival::Surv(time, status)'.

# this works
hardhat::mold(Surv(time, status) ~ age, data = lung)
#> $predictors
#> # A tibble: 228 x 1
#>      age
#>    <dbl>
#>  1    74
#>  2    68
#>  3    56
#>  4    57
#>  5    60
#>  6    74
#>  7    68
#>  8    71
#>  9    53
#> 10    61
#> # ... with 218 more rows
#> 
#> $outcomes
#> # A tibble: 228 x 1
#>    `Surv(time, status)`[,"time"] [,"status"]
#>                            <dbl>       <dbl>
#>  1                           306           1
#>  2                           455           1
#>  3                          1010           0
#>  4                           210           1
#>  5                           883           1
#>  6                          1022           0
#>  7                           310           1
#>  8                           361           1
#>  9                           218           1
#> 10                           166           1
#> # ... with 218 more rows
#> 
#> $blueprint
#> Formula blueprint: 
#>  
#> # Predictors: 1 
#>   # Outcomes: 2 
#>    Intercept: FALSE 
#> Novel Levels: FALSE 
#>  Composition: tibble 
#>   Indicators: traditional 
#> 
#> $extras
#> $extras$offset
#> NULL

Created on 2021-11-06 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> - Session info -------------------------------------------------------------- #> hash: sparkle, wine glass, pouting cat #> #> setting value #> version R version 4.1.1 (2021-08-10) #> os Windows 10 x64 (build 18363) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United States.1252 #> ctype English_United States.1252 #> tz America/New_York #> date 2021-11-06 #> pandoc 2.14.0.3 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown) #> #> - Packages ------------------------------------------------------------------- #> package * version date (UTC) lib source #> backports 1.3.0 2021-10-27 [1] CRAN (R 4.1.1) #> cli 3.1.0 2021-10-27 [1] CRAN (R 4.1.1) #> crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.1) #> digest 0.6.28 2021-09-23 [2] CRAN (R 4.1.1) #> ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.1.0) #> evaluate 0.14 2019-05-28 [2] CRAN (R 4.1.0) #> fansi 0.5.0 2021-05-25 [2] CRAN (R 4.1.0) #> fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.1.0) #> fs 1.5.0 2020-07-31 [2] CRAN (R 4.1.0) #> glue 1.4.2 2020-08-27 [2] CRAN (R 4.1.0) #> hardhat 0.1.6 2021-07-14 [2] CRAN (R 4.1.0) #> highr 0.9 2021-04-16 [2] CRAN (R 4.1.0) #> htmltools 0.5.2 2021-08-25 [2] CRAN (R 4.1.1) #> knitr 1.36 2021-09-29 [2] CRAN (R 4.1.1) #> lattice 0.20-45 2021-09-22 [2] CRAN (R 4.1.1) #> lifecycle 1.0.1 2021-09-24 [2] CRAN (R 4.1.0) #> magrittr 2.0.1 2020-11-17 [2] CRAN (R 4.1.0) #> Matrix 1.3-4 2021-06-01 [2] CRAN (R 4.1.1) #> pillar 1.6.4 2021-10-18 [2] CRAN (R 4.1.1) #> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.1.0) #> purrr 0.3.4 2020-04-17 [2] CRAN (R 4.1.0) #> R.cache 0.15.0 2021-04-30 [2] CRAN (R 4.1.0) #> R.methodsS3 1.8.1 2020-08-26 [2] CRAN (R 4.1.0) #> R.oo 1.24.0 2020-08-26 [2] CRAN (R 4.1.0) #> R.utils 2.11.0 2021-09-26 [2] CRAN (R 4.1.0) #> reprex 2.0.1 2021-08-05 [2] CRAN (R 4.1.0) #> rlang 0.4.12 2021-10-18 [1] CRAN (R 4.1.1) #> rmarkdown 2.11 2021-09-14 [2] CRAN (R 4.1.1) #> rstudioapi 0.13 2020-11-12 [2] CRAN (R 4.1.0) #> sessioninfo 1.2.1 2021-11-02 [1] CRAN (R 4.1.1) #> stringi 1.7.5 2021-10-04 [2] CRAN (R 4.1.1) #> stringr 1.4.0 2019-02-10 [2] CRAN (R 4.1.0) #> styler 1.6.2 2021-09-23 [2] CRAN (R 4.1.1) #> survival * 3.2-13 2021-08-24 [2] CRAN (R 4.1.1) #> tibble 3.1.5 2021-09-30 [2] CRAN (R 4.1.1) #> utf8 1.2.2 2021-07-24 [2] CRAN (R 4.1.0) #> vctrs 0.3.8 2021-04-29 [2] CRAN (R 4.1.0) #> withr 2.4.2 2021-04-18 [2] CRAN (R 4.1.0) #> xfun 0.27 2021-10-18 [1] CRAN (R 4.1.1) #> yaml 2.2.1 2020-02-01 [2] CRAN (R 4.1.0) #> #> [1] C:/Users/sjobergd/R-dev #> [2] C:/Program Files/R/R-4.1.1/library #> #> ------------------------------------------------------------------------------ ```
juliasilge commented 2 years ago

Looks like this is happening in detect_interactions(), leading to the false positive here that you experience:

https://github.com/tidymodels/hardhat/blob/78367e5e4c8746b3f44c7338140ad2a4313fea6e/R/blueprint-formula-default.R#L889

ddsjoberg commented 2 years ago

Thank you @juliasilge for taking a look!

Perhaps the line could be updated with lookahead and lookbehind regular expression. The regex below looks for a colon that is neither preceded nor followed by another colon.

terms_nms <- c("inter:action", "no::interaction", "varname")

grepl("(?<!:):(?!:)", terms_nms, perl = TRUE)
#> [1]  TRUE FALSE FALSE

Created on 2021-11-17 by the reprex package (v2.0.1)

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.