tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
568 stars 112 forks source link

Possible to use recipes for novel levels in testing data? #105

Closed BigTimeStats closed 6 years ago

BigTimeStats commented 6 years ago

Hello and thank you for putting together this package.

What’s the best way to use recipes and step_dummy() to handle novel categorical levels in testing/validation data? Since this is an issue I encounter quite frequently, I wonder if recipes could help on this front.

I notice in your documentation that dummy variables are generated with reference to the first level:

By default, the missing dummy variable (i.e. the reference cell) will correspond to the first level of the unordered factor being converted.

Is there a benefit to coding the matrix this way vs. having all levels as columns?

When running the code below, using recipes, NA’s are generated for novel levels in the testing dataset. This seems less than ideal, should the user be warned of this behavior?

If the dummy vars are not generated with reference to the first level, they could just be encoded as 0’s across the board (for that variable/column’s levels) so that predictions could still be made on the dataset (along with a possible warning). I could see why this might be unwise, and currently I run some custom functions to remap novel levels, but from an automated model scoring perspective, it could be helpful to at least have the functionality. Let me know your thoughts.

df <- data.frame(y = c(1,0,1,1,0,0,0,1,1,1,0,0,1,0,1,0,0,0,1,0),
                                     x1 = c('A','B','B','B','B','A','A','A','B','A','A','B','A','C','C','B','A','B','C','A'),
                                     stringsAsFactors = FALSE)

training <- df[1:10,]
testing <- df[11:20,]

rec_obj <- recipe(y ~ ., data = training) %>%
            step_dummy(all_nominal(), -all_outcomes()) %>%
            prep(data = training)

x_train_tbl <- bake(rec_obj, newdata = training)
x_test_tbl  <- bake(rec_obj, newdata = testing)

x_train_tbl
# # A tibble: 10 x 2
#       y  x1_B
#   <dbl> <dbl>
# 1     1     0
# 2     0     1
# 3     1     1
# 4     1     1
# 5     0     1
# 6     0     0
# 7     0     0
# 8     1     0
# 9     1     1
# 10    1     0

x_test_tbl
# A tibble: 10 x 2
#       y  x1_B
#   <dbl> <dbl>
# 1     0     0
# 2     0     1
# 3     1     0
# 4     0    NA
# 5     1    NA
# 6     0     1
# 7     0     0
# 8     0     1
# 9     1    NA
# 10    0     0

# This would make the missing levels coded as "A"
# x_test_tbl[is.na(x_test_tbl)] <- 0

# Code only to produce possible expected output
x_test_tbl2 <- data.frame(y = x_test_tbl$y, x1_A = x_test_tbl$x1_B, x1_B = x_test_tbl$x1_B)
x_test_tbl2$x1_A <- ifelse(x_test_tbl2$x1_A == 0, 1, 0)

# If step_dummy is not coded to reference level, then all that would be needed is:
x_test_tbl2[is.na(x_test_tbl2)] <- 0

x_test_tbl2
#    y x1_A x1_B
# 1  0    1    0
# 2  0    0    1
# 3  1    1    0
# 4  0    0    0
# 5  1    0    0
# 6  0    0    1
# 7  0    1    0
# 8  0    0    1
# 9  1    0    0
# 10 0    1    0
topepo commented 6 years ago

Is there a benefit to coding the matrix this way vs. having all levels as columns?

A lot of times, yes. For parametric models, a full set of dummy variables will cause a linear dependency in the data (since the sum of those columns are aliased with the intercept). In general, if you know all but one dummy variable, you can figure out the last one. This is how model.matrix and almost every other piece of modeling software does it (by default at least).

Not all models do this and I will (eventually) make an option for the full set of dummies (as caret does) but there are some technical obstacles there that make it non-trivial.

What’s the best way to use recipes and step_dummy() to handle novel categorical levels in testing/validation data? Since this is an issue I encounter quite frequently, I wonder if recipes could help on this front.

I'm adjusting step_other to do this (even if you don't want to pool the factor levels). It will be able to assign a value of "other" (or whatever you like) to novel values.

There is a PR in waiting to do feature hashing. In that case, values are assigned to however many indicators that you specify when the step is created. It isn't random but should uniformly distribute the existing (and new) levels of the factor across the indicators. There can be quite a lot of aliasing between factor levels but that's just how it works.

When running the code below, using recipes, NA’s are generated for novel levels in the testing dataset. This seems less than ideal, should the user be warned of this behavior?

Agreed. I had this discussion with someone else about this and I'll make a change to throw a warning. There is a branch of the package that allows for "checks" to be put in place which can do simple data validation. If you would like it to throw an error, the check_missing step in the PR would do that for you if you place it after step_dummy.

Interestingly, model.matrix is silent on the matter:

> iris_1 <- iris[1:90,]
> iris_1$Species <- factor(as.character(iris_1$Species))
> 
> iris_2 <- iris[91:150,]
> iris_2$Species <- factor(as.character(iris_2$Species))
> 
> term_obj <- terms(~ ., data = iris_1)
> 
> iris_1$Species[c(1, 51)]
[1] setosa     versicolor
Levels: setosa versicolor
> model.matrix(term_obj, data = iris_1[c(1, 51),])
   (Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width Speciesversicolor
1            1          5.1         3.5          1.4         0.2                 0
51           1          7.0         3.2          4.7         1.4                 1
attr(,"assign")
[1] 0 1 2 3 4 5
attr(,"contrasts")
attr(,"contrasts")$Species
[1] "contr.treatment"

> 
> iris_2$Species[c(1, 10)]
[1] versicolor versicolor
Levels: versicolor virginica
> model.matrix(term_obj, data = iris_2[c(1, 10),])
    (Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width Speciesvirginica
91            1          5.5         2.6          4.4         1.2                0
100           1          5.7         2.8          4.1         1.3                0
attr(,"assign")
[1] 0 1 2 3 4 5
attr(,"contrasts")
attr(,"contrasts")$Species
[1] "contr.treatment"
topepo commented 6 years ago

Also, WinVector/vtreat has a bunch of methods for dealing with novel levels.

BigTimeStats commented 6 years ago

Thank you for the response.

Understood on the linear dependency, I know SAS does it with respect to the base level as well.

Not all models do this and I will (eventually) make an option for the full set of dummies (as caret does) but there are some technical obstacles there that make it non-trivial.

It's helpful for interpretability when calling varImp() to have the full set of dummies instead of having to look up the level that they are in relation to. I assume this is partly the logic behind caret

I'm adjusting step_other to do this (even if you don't want to pool the factor levels). It will be able to assign a value of "other" (or whatever you like) to novel values.

If it does assign a value of 'other', how would it work if there is no 'other' value in the training dataframe?

Interestingly, model.matrix is silent on the matter

Yep, I also tried that but didn't create the object, just formula for both the training/testing:

df <- data.frame(y = c(1,0,1,1,0,0,0,1,1,1,0,0,1,0,1,0,0,0,1,0),
                x1 = c('A','B','B','B','B','A','A','A','B','A','A','B','A','C','C','B','A','B','C','A'),
                stringsAsFactors = FALSE)
training <- df[1:10,]
testing <- df[11:20,]
training$y <- as.factor(training$y)
training$x1 <- as.factor(training$x1)
testing$y <- as.factor(testing$y)
testing$x1 <- as.factor(testing$x1)
model.matrix(y ~ x1, training, contrasts.arg = lapply(training, contrasts, contrasts=FALSE))
model.matrix(y ~ x1, testing, contrasts.arg = lapply(testing, contrasts, contrasts=FALSE))

While researching this, I noticed that caret::dummyVars has the functionality I am basically looking for, but only if the data is submitted as character rather than factor:

No error when submitted as character:

training$x1 <- as.character(training$x1)
testing$x1 <- as.character(testing$x1)
dummy_obj <- dummyVars(~ x1, training, sep = '_') # The sep argument does not seem to work when the data is character
predict(dummy_obj, training) 
predict(dummy_obj, testing) # no error

When factor, returns an error:

training$x1 <- as.factor(training$x1)
testing$x1 <- as.factor(testing$x1)
dummy_obj <- dummyVars(~ x1, training, sep = '_') 
predict(dummy_obj, training) # sep argument works when data is factor
predict(dummy_obj, testing) # error
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$lvls) : 
# factor x1 has new levels C

I could imagine a function that would use caret::dummyVars and add the functionality that if a level is present in the training data but not in the testing data, a new column would be added with all 0's.

Something like this:

columns_to_append_to_testing <- colnames(training)[!(colnames(training) %in% colnames(testing))]
matrix_to_append <- matrix(0, nrow = nrow(testing), ncol = length(columns_to_append_to_testing))
colnames(matrix_to_append) <- columns_to_append_to_testing

Also have looked into vtreat but have not experimented with it.

I'd love to contribute more but after looking at some of the code in dummy.R, not sure I have the knowledge making the above fit into step_novel, etc. Thanks!

topepo commented 6 years ago

Check that code out and see if it does what you are looking for.

BigTimeStats commented 6 years ago

Thanks for adding that. The code works, but functions differently than I would expect. It adds a constant column to the modeling dataset named x1_new, which throws errors when running certain models. My suggestion would be to allow the option to create dummies without reference to the original level, so that the testing dataset can contain the modeling dataset levels (as columns), but coded as 0. Looks like glm, xgbTree (probably others too) models run with just a warning:

Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut = 10, : These variables have zero variances: x1_new

But this generates an error:

df <- data.frame(y = c(1,0,1,1,0,0,0,1,1,1,0,0,1,0,1,0,0,0,1,0),
                 x1 = c('A','B','B','B','B','A','A','A','B','A','A','B','A','C','C','B','A','B','C','A'),
                 stringsAsFactors = FALSE)
df$y <- ifelse(df$y == 1, 'Y','N') %>% as.factor()
training <- df[1:10,]
testing <- df[11:20,]

rec_obj <- recipe(y ~ ., data = training) %>%
    step_novel(all_nominal(), -all_outcomes()) %>%
    step_dummy(all_nominal(), -all_outcomes()) %>% 
    prep(data = training)

x_train_tbl <- bake(rec_obj, newdata = training)
x_test_tbl  <- bake(rec_obj, newdata = testing)

glimpse(x_train_tbl)
# Observations: 10
# Variables: 3
# $ y      <fctr> Y, N, Y, Y, N, N, N, Y, Y, Y
# $ x1_B   <dbl> 0, 1, 1, 1, 1, 0, 0, 0, 1, 0
# $ x1_new <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

model2 <- caret::train(y ~ .,
                         x_train_tbl,
                         method = 'lda',
                         trControl = caret::trainControl(method = 'none',
                                                         verboseIter = TRUE,
                                                         classProbs = TRUE),
                         metric = 'Accuracy',
                         preProcess = c('center','scale'))

# Fitting parameter = none on full training set
# Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut = 10,  :
#   These variables have zero variances: x1_new

# Error in lda.default(x, grouping, ...) : 
#   variable 2 appears to be constant within groups 
topepo commented 6 years ago

You can added the preprocessing method "zv" to remove zero variance columns before they get to the model function.

My suggestion would be to allow the option to create dummies without reference to the original level, so that the testing dataset can contain the modeling dataset levels (as columns), but coded as 0.

I don't see how that would help you if the next data set has a new factor level.

caret has a different contrast function called contr.ltfr that should do this:

> options(contrasts = c(unordered = "contr.ltfr", ordered = "contr.poly"))
> colnames(model.matrix(~ ., data = iris))
[1] "(Intercept)"       "Sepal.Length"      "Sepal.Width"       "Petal.Length"     
[5] "Petal.Width"       "Speciessetosa"     "Speciesversicolor" "Speciesvirginica" 

I haven't integrated different contrasts into step_dummy yet.

BigTimeStats commented 6 years ago

I don't see how that would help you if the next data set has a new factor level.

If the column created with step_novel is removed from the modeling dataset, my understanding is the model would not be able to score the new dataset. Similarly, when you run factors through a caret model and a new level appears in testing, the predict function will fail. Technically, however, a new column in the testing dataset will allow predict to generate new predictions. In this sense, the new columns created by step_dummy() will just be extra columns, and the prediction won't (or shouldn't necessarily) fail. However, because the levels within the column are in reference to the base level, their interpretation will change and the model scoring will also be wrong.

If the model is trained on a column with levels A & B and this is the training frame

# x1
#  A
#  A
#  B

# Even if step_novel is added, it will be removed in training due to zero variance columns
# x1_B x1_new
#    0      0
#    0      0
#    1      0

# Then, when scoring on testing, as such
# x1
#  A
#  B
#  C

# Running recipes on it:
# x1_B x1_new
#    0      0
#    1      0
#    0      1   # This means, because x1_new is not in the model, that this row is interpreted at "A"
                   # since x1_B is 0.

# That's why a non-parametric model could still perform `predict` on this dataset if it includes the base
# level. Running recipes on it:
# x1_A x1_B x1_new
#    1    0      0
#    0    1      0
#    0    0      1    # Since x1_new is not in the model, this row is de-facto 'other' since 
                          # both x1_A and x1_B are 0

Thank you.

topepo commented 6 years ago

I see your point about the column getting removed (or ignored). Once I looked at it, I see that lm as drop.unsed.levels hardcoded in.

I don't agree with this though:

However, because the levels within the column are in reference to the base level, their interpretation will change and the model scoring will also be wrong.

It wouldn't matter how many levels the variable would have (under the standard "treatment" encoding), all zeros would indicate the reference level. Once the model is trained with the columns that it sees, it wouldn't even use x1_new1.

I'll get the feature hashing approach into recipes and also work on the "less than full rank contrasts" (which needs a better name).

1 I think that the lesson here would be to use models that don't use dummy variables such as trees, rules, naive Bayes and a few others. The "new" slot would always be in the model from the start.

topepo commented 6 years ago

So now step_other effectively can assign new levels to "other".

Also, there is the embed package that can use other steps for dealing with novel levels.

pablo14 commented 5 years ago

Hi @topepo, thanks for this package (too). I'm preparing a post about it.

Want to ask if step_other could allow 0 in the threshold parameter.

The case is: In a predictive model, it's not useful to collapse the low share categories into a group. Some values might carry valuable information considering the target variable.

However, I still want to have the other category during training time. I do this by hand in production.

The "hack" I thought is assigning threshold = 0.000000000000001. Any more elegant ideas?

Thanks.

topepo commented 5 years ago

Once, in the last few months, I thought about doing that and forgot all about it. Are you my conscience?

It'll be in the next release. I'll loop back around to recipe sin the next 2-3 weeks.

Can you start a new issue though?

pablo14 commented 5 years ago

Are you my conscience?

Haha! well I found myself "inventing" things that after some time, they were already invented by others... maybe after a long time around some topic, common sense suggests the solution 😄

Can you start a new issue though?

Done: https://github.com/tidymodels/recipes/issues/334 (will delay the blog post until the solution).

Thanks!

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.