mlr-org / mlrCPO

Composable Preprocessing Operators for MLR
Other
37 stars 4 forks source link

Set fix.factors = TRUE for cpoCollapseFact #52

Open TuSKan opened 5 years ago

TuSKan commented 5 years ago

Any suggestion ?

library(mlr)
library(mlrCPO)

#download.file("https://www.openml.org/data/download/31/dataset_31_credit-g.arff","dataset_31_credit-g.arff")
data <- farff::readARFF("dataset_31_credit-g.arff")
task <- makeClassifTask(data = data, target = "class", positive = "good")
task %<>>% cpoCollapseFact(0.1)
ret <- retrafo(task)
learner <- makeLearner("classif.ranger", predict.type = "prob", par.vals = list(num.trees = 100, mtry = 2))
model <- train(learner,task)
newdata <- data[, names(data) != "class"] #newdata to predict, I dont have target, can't make a Task
pred <- predict(model, newdata = newdata %>>% ret) #https://mlr-org.github.io/mlr/articles/tutorial/predict.html

> Error in `levels<-.factor`(`*tmp*`, value = c("male div/sep", "female div/dep/mar",  : 
>   number of levels differs
mb706 commented 5 years ago

The problem here is that the data appears to contain an empty factor level in the personal_status column, and that makeClassifTask() removes that empty factor level:

> levels(getTaskData(task)[["personal_status"]])
[1] "male div/sep"       "female div/dep/mar" "male single"
[4] "male mar/wid"
> levels(data[["personal_status"]])
[1] "male div/sep"       "female div/dep/mar" "male single"
[4] "male mar/wid"       "female single"

This should not crash cpoCollapseFact and is a bug. The workaround until then is to use cpoFixFactors:

task %<>>% cpoFixFactors() %>>% cpoCollapseFact(0.1)