Closed statist-bhfz closed 4 years ago
Hey,
If I understand your question correctly, your goal is to add a .MISSING
dummy indicator during test, when a level is
not available during training.
This is not sensible, as this column would be constant during training and thus contain virtually 0 information.
To me it is also unclear, how any algorithm should do something meaningful with such new information during predict.
Feel free to provide references / hints towards situations where this is being dealt with differently, happy to learn there.
po("fixfactors")
, thus simply recodes the new factor level to NA
, and several imputation strategies can be employed to impute a different level.
My question was more about unexpected behaviour rather than about practical usage. I slightly changed my example by adding NA's in training set:
dt_train <- data.table(fct = factor(c(1, 1, NA, 2, 2, 2)),
target_result = factor(1:2))
and got the desired output (mapping new factor levels to the same fct..MISSING
level as NA's):
target_result fct.1 fct.2 fct..MISSING
1: 1 0 1 0
2: 2 0 0 1
So, everything works fine, additional column is produced only when it makes sense.
My question in somewhat related to https://github.com/mlr-org/mlr3pipelines/issues/71 I'm not able to implement very simple approach to dealing on prediction stage with factor levels unseen during training:
Desired output:
NA
can be replaced with 0 usingmlr_pipeops_imputeconstant()
(and it is sufficient in most cases), but how to add column for new factor level(s)? It looks like an issue with interaction betweenpo("fixfactors")
andpo("missind")
applied sequentially.