Closed cregouby closed 2 years ago
This comes from model.matrix()
which I guess treats logical columns like factors? So dummy variables are created? And when there is no intercept factors are always fully expanded (this is expected). The whole point of the formula interface is to use model.matrix()
with very little modifications, so on first thought I don't think we are going to want to make any changes here.
data <- data.frame(
feat_logical = sample(c(TRUE, FALSE), 10, replace = TRUE),
target = factor(sample(c("yes", "no"), 10, replace = TRUE))
)
model.matrix(
target ~ feat_logical + 0,
data
)
#> feat_logicalFALSE feat_logicalTRUE
#> 1 0 1
#> 2 0 1
#> 3 1 0
#> 4 0 1
#> 5 1 0
#> 6 1 0
#> 7 0 1
#> 8 0 1
#> 9 0 1
#> 10 1 0
#> attr(,"assign")
#> [1] 1 1
#> attr(,"contrasts")
#> attr(,"contrasts")$feat_logical
#> [1] "contr.treatment"
# we have no intercept by default
hardhat::mold(
target ~ feat_logical,
data
)$predictors
#> # A tibble: 10 × 2
#> feat_logicalFALSE feat_logicalTRUE
#> <dbl> <dbl>
#> 1 0 1
#> 2 0 1
#> 3 1 0
#> 4 0 1
#> 5 1 0
#> 6 1 0
#> 7 0 1
#> 8 0 1
#> 9 0 1
#> 10 1 0
model.matrix(
target ~ feat_logical,
data
)
#> (Intercept) feat_logicalTRUE
#> 1 1 1
#> 2 1 1
#> 3 1 0
#> 4 1 1
#> 5 1 0
#> 6 1 0
#> 7 1 1
#> 8 1 1
#> 9 1 1
#> 10 1 0
#> attr(,"assign")
#> [1] 0 1
#> attr(,"contrasts")
#> attr(,"contrasts")$feat_logical
#> [1] "contr.treatment"
hardhat::mold(
target ~ feat_logical,
data,
blueprint = hardhat::default_formula_blueprint(intercept = TRUE)
)$predictors
#> # A tibble: 10 × 2
#> `(Intercept)` feat_logicalTRUE
#> <dbl> <dbl>
#> 1 1 1
#> 2 1 1
#> 3 1 0
#> 4 1 1
#> 5 1 0
#> 6 1 0
#> 7 1 1
#> 8 1 1
#> 9 1 1
#> 10 1 0
Created on 2021-12-16 by the reprex package (v2.0.1)
I'm pretty sure this is expected since model.matrix()
has some code that checks is.factor() || is.logical()
for determining if something is factorish
https://github.com/wch/r-source/blob/79298c499218846d14500255efd622b5021c10ec/src/library/stats/R/models.R#L642
Hello David, Thanks for this very clear and documented investigation. I'll be able to leave with it !
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
The problem
I'm having trouble with
mold.formula()
when one predictor is logical. The moldresult$predictors
then includes twodbl
columns per logical feature what is not expected.Neither
mold.recipe()
normold.data.frame()
on the same data suffer this issue.Reproducible example
Unexpected two
dbl
columnsCreated on 2021-12-15 by the reprex package (v2.0.1)
Unexpected difference between fomula and reciepe
Created on 2021-12-15 by the reprex package (v2.0.1)
Workaround
Either
recipe()
, like in the second part of the reprex, ormold()
(with different results depending on the downstream model)