tidymodels / hardhat

Construct Modeling Packages
https://hardhat.tidymodels.org
Other
101 stars 16 forks source link

`mold.formula()` creates two predictor columns per logical feature #178

Closed cregouby closed 2 years ago

cregouby commented 2 years ago

The problem

I'm having trouble with mold.formula() when one predictor is logical. The mold result$predictors then includes two dbl columns per logical feature what is not expected.

Neither mold.recipe() nor mold.data.frame() on the same data suffer this issue.

Reproducible example

Unexpected two dbl columns

library(hardhat)
# Data with logical feature --------------------------------------
data <- data.frame(
  feat_logical = sample(c(TRUE, FALSE), 10, replace = TRUE),
  target = factor(sample(c("yes", "no"), 10, replace = TRUE))
)

mold_formula <-  mold(target~. , data=data)
# expecting a single predictor column here
mold_formula$predictors
#> # A tibble: 10 × 2
#>    feat_logicalFALSE feat_logicalTRUE
#>                <dbl>            <dbl>
#>  1                 1                0
#>  2                 0                1
#>  3                 1                0
#>  4                 1                0
#>  5                 1                0
#>  6                 1                0
#>  7                 1                0
#>  8                 1                0
#>  9                 1                0
#> 10                 1                0

Created on 2021-12-15 by the reprex package (v2.0.1)

Unexpected difference between fomula and reciepe

library(hardhat)
# Data with logical feature --------------------------------------
data <- data.frame(
  feat_logical = sample(c(TRUE, FALSE), 10, replace = TRUE),
  target = factor(sample(c("yes", "no"), 10, replace = TRUE))
)

# expecting predictors to be equals between formula and void recipe
mold_formula <-  mold(target~. , data=data)
mold_recipe <- mold(recipes::recipe(target~., data), data=data)

waldo::compare(mold_formula$predictors, mold_recipe$predictors)
#> `old` is length 2
#> `new` is length 1
#> 
#> `names(old)`: "feat_logicalFALSE" "feat_logicalTRUE"
#> `names(new)`: "feat_logical"                        
#> 
#> `old$feat_logicalFALSE` is a double vector (1, 1, 1, 1, 1, ...)
#> `new$feat_logicalFALSE` is absent
#> 
#> `old$feat_logicalTRUE` is a double vector (0, 0, 0, 0, 0, ...)
#> `new$feat_logicalTRUE` is absent
#> 
#> `old$feat_logical` is absent
#> `new$feat_logical` is a logical vector (FALSE, FALSE, FALSE, FALSE, FALSE, ...)

Created on 2021-12-15 by the reprex package (v2.0.1)

Workaround

Either

DavisVaughan commented 2 years ago

This comes from model.matrix() which I guess treats logical columns like factors? So dummy variables are created? And when there is no intercept factors are always fully expanded (this is expected). The whole point of the formula interface is to use model.matrix() with very little modifications, so on first thought I don't think we are going to want to make any changes here.

data <- data.frame(
  feat_logical = sample(c(TRUE, FALSE), 10, replace = TRUE),
  target = factor(sample(c("yes", "no"), 10, replace = TRUE))
)

model.matrix(
  target ~ feat_logical + 0, 
  data
)
#>    feat_logicalFALSE feat_logicalTRUE
#> 1                  0                1
#> 2                  0                1
#> 3                  1                0
#> 4                  0                1
#> 5                  1                0
#> 6                  1                0
#> 7                  0                1
#> 8                  0                1
#> 9                  0                1
#> 10                 1                0
#> attr(,"assign")
#> [1] 1 1
#> attr(,"contrasts")
#> attr(,"contrasts")$feat_logical
#> [1] "contr.treatment"

# we have no intercept by default
hardhat::mold(
  target ~ feat_logical, 
  data
)$predictors
#> # A tibble: 10 × 2
#>    feat_logicalFALSE feat_logicalTRUE
#>                <dbl>            <dbl>
#>  1                 0                1
#>  2                 0                1
#>  3                 1                0
#>  4                 0                1
#>  5                 1                0
#>  6                 1                0
#>  7                 0                1
#>  8                 0                1
#>  9                 0                1
#> 10                 1                0

model.matrix(
  target ~ feat_logical, 
  data
)
#>    (Intercept) feat_logicalTRUE
#> 1            1                1
#> 2            1                1
#> 3            1                0
#> 4            1                1
#> 5            1                0
#> 6            1                0
#> 7            1                1
#> 8            1                1
#> 9            1                1
#> 10           1                0
#> attr(,"assign")
#> [1] 0 1
#> attr(,"contrasts")
#> attr(,"contrasts")$feat_logical
#> [1] "contr.treatment"

hardhat::mold(
  target ~ feat_logical, 
  data, 
  blueprint = hardhat::default_formula_blueprint(intercept = TRUE)
)$predictors
#> # A tibble: 10 × 2
#>    `(Intercept)` feat_logicalTRUE
#>            <dbl>            <dbl>
#>  1             1                1
#>  2             1                1
#>  3             1                0
#>  4             1                1
#>  5             1                0
#>  6             1                0
#>  7             1                1
#>  8             1                1
#>  9             1                1
#> 10             1                0

Created on 2021-12-16 by the reprex package (v2.0.1)

DavisVaughan commented 2 years ago

I'm pretty sure this is expected since model.matrix() has some code that checks is.factor() || is.logical() for determining if something is factorish https://github.com/wch/r-source/blob/79298c499218846d14500255efd622b5021c10ec/src/library/stats/R/models.R#L642

cregouby commented 2 years ago

Hello David, Thanks for this very clear and documented investigation. I'll be able to leave with it !

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.