Problem with predicting on new data for character column

juliasilge commented 2 years ago

I'm not totally sure if this is a problem for workflows or hardhat (or something else?) but there is a problem for predicting with new data that has one row (pretty common situation for deployment) and a variable that is character.

library(tidymodels)
data("tree_frogs", package = "stacks")

tree_frogs <- tree_frogs %>%
  filter(!is.na(latency)) %>%
  select(-c(clutch, hatched))

## notice that `treatment` is a character:
tree_frogs
#> # A tibble: 572 × 5
#>    treatment  reflex    age t_o_d     latency
#>    <chr>      <fct>   <dbl> <chr>       <dbl>
#>  1 control    full   466965 morning        22
#>  2 control    low    361180 night         360
#>  3 control    full   401595 afternoon     106
#>  4 control    mid    357810 night         180
#>  5 control    full   397440 afternoon      60
#>  6 gentamicin full   463230 morning        39
#>  7 control    full   393900 afternoon     214
#>  8 control    full   469065 morning        50
#>  9 control    full   400240 afternoon     224
#> 10 control    full   466160 morning        63
#> # … with 562 more rows

set.seed(123)
frog_split <- initial_split(tree_frogs, prop = 0.8)
frog_train <- training(frog_split)
frog_test <- testing(frog_split)

tree_spec <-
  decision_tree() %>%
  set_mode("regression")

tree_fit <-
  workflow(latency ~ treatment + age, tree_spec) %>% 
  fit(data = frog_train) 

new_frog <- tibble(treatment = "control", age = 4e6)
predict(tree_fit, new_data = new_frog)
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels

^{Created on 2022-08-24 with reprex v2.0.2}

This will work if:

the new data has more than one value in treatment (doesn't have to be a factor)
the treatment variable was converted to factor before training

juliasilge commented 2 years ago

I think this is likely a workflows problem because the problem also goes away when fitting the parsnip model tree_spec and predicting on that.

DavisVaughan commented 2 years ago

I think this is probably a hardhat bug, and I think there might actually be two bugs, one when indicators = "none" (this case) and one when indicators = "traditional".

When indicators = "none", hardhat:::forge_formula_default_process_predictors() should be removing the factor ish columns from the data before calling model_matrix() on it. It doesn't look like it is doing that correctly right now.

When indicators = "traditional", at mold time it seems like we might need to capture information about character columns that were converted to factors and pass that on to forge

DavisVaughan commented 2 years ago

indicators = "none" reprex

For this one we might need to make forge_formula_default_process_predictors() remove the factors columns before model_frame/matrix()

library(hardhat)
library(tibble)

df <- tibble(
  y = c(1, 3, 2), 
  treatment = c("x", "y", "y"),
  other = c(2, 2, 5)
)

mold <- mold(
  y ~ treatment + other, 
  df, 
  blueprint = default_formula_blueprint(indicators = "none")
)

# original ptype is character
mold$blueprint$ptypes$predictors
#> # A tibble: 0 × 2
#> # … with 2 variables: treatment <chr>, other <dbl>

# output type is character
mold$predictors[0,]
#> # A tibble: 0 × 2
#> # … with 2 variables: other <dbl>, treatment <chr>

# we get the error here, but with `indicators = "none"` i
# dont think we should have been passing any of the factor
# or character columns through to model.matrix
one_row <- df[2,]
forge(one_row, mold$blueprint)
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels

indicators = "traditional" reprex

For this one (only when indicators != "none") we might need to convert character columns to factors ahead of time in mold() (before the model matrix call), then retain the ptypes of the columns we did this to in the blueprint result, then on the forge() side we take that ptype and convert the new_data columns to match it. This might happen in the clean steps of both mold and forge

library(hardhat)
library(tibble)

df <- tibble(
  y = c(1, 3, 2), 
  treatment = c("x", "y", "y"),
  other = c(2, 2, 5)
)

mold <- mold(
  y ~ treatment + other, 
  df, 
  blueprint = default_formula_blueprint(indicators = "traditional")
)

# original ptype is character
mold$blueprint$ptypes$predictors
#> # A tibble: 0 × 2
#> # … with 2 variables: treatment <chr>, other <dbl>

# output type has the expanded factor columns
mold$predictors[0,]
#> # A tibble: 0 × 3
#> # … with 3 variables: treatmentx <dbl>, treatmenty <dbl>, other <dbl>

# we get the error here because `model.frame()` doesn't
# know to convert the character column to a factor column.
one_row <- df[2,]
forge(one_row, mold$blueprint)
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels

DavisVaughan commented 2 years ago

In this indicators = "none" case it actually looks like we were trying to do a smart thing. We added a - treatment on to the end of the formula to remove it from model.frame() and model.matrix(). But I guess model.matrix() still does some computations on treatment (like look at contrasts) even though the formula removes it which screws things up here:

df <- data.frame(
  y = c(1, 3, 2), 
  treatment = c("x", "y", "y"),
  other = c(2, 2, 5),
  stringsAsFactors = FALSE
)

# this formula is really trying to say that `treatment` should
# be ignored here, but it does some processing with it anyways
frame <- model.frame(~treatment + other + 0 - treatment, df)
terms <- attr(frame, "terms")

# see, treatment was ignored
model.matrix(terms, df)
#>   other
#> 1     2
#> 2     2
#> 3     5
#> attr(,"assign")
#> [1] 1
#> attr(,"contrasts")
#> attr(,"contrasts")$treatment
#> [1] "contr.treatment"

# but it seems like it was trying to use it anyways
model.matrix(terms, df[2,])
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels

Ideally I'd call this a base R bug but I think we are going to need to come up with an alternative strategy here vs tacking - treatment onto the end of the formula.

github-actions[bot] commented 1 year ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

tidymodels / hardhat

Problem with predicting on new data for character column #213