Closed juliasilge closed 1 year ago
I think this is likely a workflows problem because the problem also goes away when fitting the parsnip model tree_spec
and predicting on that.
I think this is probably a hardhat bug, and I think there might actually be two bugs, one when indicators = "none"
(this case) and one when indicators = "traditional"
.
When indicators = "none"
, hardhat:::forge_formula_default_process_predictors()
should be removing the factor ish columns from the data before calling model_matrix()
on it. It doesn't look like it is doing that correctly right now.
When indicators = "traditional"
, at mold time it seems like we might need to capture information about character columns that were converted to factors and pass that on to forge
indicators = "none"
reprex
For this one we might need to make forge_formula_default_process_predictors()
remove the factors columns before model_frame/matrix()
library(hardhat)
library(tibble)
df <- tibble(
y = c(1, 3, 2),
treatment = c("x", "y", "y"),
other = c(2, 2, 5)
)
mold <- mold(
y ~ treatment + other,
df,
blueprint = default_formula_blueprint(indicators = "none")
)
# original ptype is character
mold$blueprint$ptypes$predictors
#> # A tibble: 0 × 2
#> # … with 2 variables: treatment <chr>, other <dbl>
# output type is character
mold$predictors[0,]
#> # A tibble: 0 × 2
#> # … with 2 variables: other <dbl>, treatment <chr>
# we get the error here, but with `indicators = "none"` i
# dont think we should have been passing any of the factor
# or character columns through to model.matrix
one_row <- df[2,]
forge(one_row, mold$blueprint)
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels
indicators = "traditional"
reprex
For this one (only when indicators != "none"
) we might need to convert character columns to factors ahead of time in mold()
(before the model matrix call), then retain the ptypes of the columns we did this to in the blueprint result, then on the forge()
side we take that ptype and convert the new_data
columns to match it. This might happen in the clean steps of both mold and forge
library(hardhat)
library(tibble)
df <- tibble(
y = c(1, 3, 2),
treatment = c("x", "y", "y"),
other = c(2, 2, 5)
)
mold <- mold(
y ~ treatment + other,
df,
blueprint = default_formula_blueprint(indicators = "traditional")
)
# original ptype is character
mold$blueprint$ptypes$predictors
#> # A tibble: 0 × 2
#> # … with 2 variables: treatment <chr>, other <dbl>
# output type has the expanded factor columns
mold$predictors[0,]
#> # A tibble: 0 × 3
#> # … with 3 variables: treatmentx <dbl>, treatmenty <dbl>, other <dbl>
# we get the error here because `model.frame()` doesn't
# know to convert the character column to a factor column.
one_row <- df[2,]
forge(one_row, mold$blueprint)
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels
In this indicators = "none"
case it actually looks like we were trying to do a smart thing. We added a - treatment
on to the end of the formula to remove it from model.frame()
and model.matrix()
. But I guess model.matrix()
still does some computations on treatment
(like look at contrasts) even though the formula removes it which screws things up here:
df <- data.frame(
y = c(1, 3, 2),
treatment = c("x", "y", "y"),
other = c(2, 2, 5),
stringsAsFactors = FALSE
)
# this formula is really trying to say that `treatment` should
# be ignored here, but it does some processing with it anyways
frame <- model.frame(~treatment + other + 0 - treatment, df)
terms <- attr(frame, "terms")
# see, treatment was ignored
model.matrix(terms, df)
#> other
#> 1 2
#> 2 2
#> 3 5
#> attr(,"assign")
#> [1] 1
#> attr(,"contrasts")
#> attr(,"contrasts")$treatment
#> [1] "contr.treatment"
# but it seems like it was trying to use it anyways
model.matrix(terms, df[2,])
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels
Ideally I'd call this a base R bug but I think we are going to need to come up with an alternative strategy here vs tacking - treatment
onto the end of the formula.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
I'm not totally sure if this is a problem for workflows or hardhat (or something else?) but there is a problem for predicting with new data that has one row (pretty common situation for deployment) and a variable that is character.
Created on 2022-08-24 with reprex v2.0.2
This will work if:
treatment
(doesn't have to be a factor)treatment
variable was converted to factor before training