tidymodels / parsnip

A tidy unified interface to models
https://parsnip.tidymodels.org
Other
564 stars 78 forks source link

linear_reg() can't handle unseen levels #1084

Open EmilHvitfeldt opened 3 months ago

EmilHvitfeldt commented 3 months ago

Originally posted in https://stackoverflow.com/questions/78169514/r-tidymodels-step-novel-does-not-work-when-combined-in-a-workflow-with-resampl.

Basically, lm() only keeps track of the levels that appear in the training data set, regardless of the levels of the factor.

So when step_novel() creates the novel level new it is completely ignored by lm() so when new data comes with novel levels it complains.

As far as I can tell this is hardcoded behavior of lm() https://github.com/SurajGupta/r-source/blob/a28e609e72ed7c47f6ddfbb86c85279a0750f0b7/src/library/stats/R/lm.R#L32

I don't know how much we can do about this, but we might be able to catch this earlier and throw a better error.

Smaller reprex below:

library(parsnip)
library(dplyr)

data("ames", package = "modeldata")

ames_mini <- ames |>
  select(Sale_Price, Lot_Shape)

ames_train <- ames_mini |>
  filter(Lot_Shape != "Regular")

levels(ames_train$Lot_Shape)
#> [1] "Regular"              "Slightly_Irregular"   "Moderately_Irregular"
#> [4] "Irregular"

ames_test <- ames_mini |>
  filter(Lot_Shape == "Regular")

lm_spec <- linear_reg()

lm_fit <- fit(lm_spec, Sale_Price ~ ., ames_train)

lm_fit$fit$xlevels
#> $Lot_Shape
#> [1] "Slightly_Irregular"   "Moderately_Irregular" "Irregular"

lm_fit |>
  predict(ames_test)
#> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor Lot_Shape has new level Regular