Encoding new levels for factors

statist-bhfz commented 4 years ago

My question in somewhat related to https://github.com/mlr-org/mlr3pipelines/issues/71 I'm not able to implement very simple approach to dealing on prediction stage with factor levels unseen during training:

library(data.table)
library(mlr3verse)

dt_train <- data.table(fct = factor(c(1, 1, 1, 2, 2, 2)),
                       target_result = factor(1:2))
dt_test <- data.table(fct = factor(c(2, 3)),
                      target_result = factor(1:2))
task <- TaskClassif$new(id = "id", 
                        backend = dt_train, 
                        target = "target_result")
task_test <- TaskClassif$new(id = "id", 
                        backend = dt_test, 
                        target = "target_result")

gr <- po("fixfactors") %>>%
  list(
    po("nop"),
    po("missind")
    ) %>>% 
  po("featureunion") %>>%
  po("imputeoor") %>>%
  po("encode", method = "one-hot")

gr$train(task)

res <- gr$predict(task_test)

res$encode.output$data()

#    target_result fct.1 fct.2
# 1:             1     0     1
# 2:             2    NA    NA

Desired output:

#    target_result fct.1 fct.2 .MISSING
# 1:             1     0     1        0
# 2:             2     0     0        1

NA can be replaced with 0 using mlr_pipeops_imputeconstant() (and it is sufficient in most cases), but how to add column for new factor level(s)? It looks like an issue with interaction between po("fixfactors") and po("missind") applied sequentially.

pfistfl commented 4 years ago

Hey,

If I understand your question correctly, your goal is to add a .MISSING dummy indicator during test, when a level is not available during training. This is not sensible, as this column would be constant during training and thus contain virtually 0 information. To me it is also unclear, how any algorithm should do something meaningful with such new information during predict. Feel free to provide references / hints towards situations where this is being dealt with differently, happy to learn there.

po("fixfactors"), thus simply recodes the new factor level to NA, and several imputation strategies can be employed to impute a different level.

statist-bhfz commented 4 years ago

My question was more about unexpected behaviour rather than about practical usage. I slightly changed my example by adding NA's in training set:

dt_train <- data.table(fct = factor(c(1, 1, NA, 2, 2, 2)),
                       target_result = factor(1:2))

and got the desired output (mapping new factor levels to the same fct..MISSING level as NA's):

   target_result fct.1 fct.2 fct..MISSING
1:             1     0     1            0
2:             2     0     0            1

So, everything works fine, additional column is produced only when it makes sense.

mlr-org / mlr3pipelines

Encoding new levels for factors #508