mlr-org / mlr3pipelines

Dataflow Programming for Machine Learning in R
https://mlr3pipelines.mlr-org.com/
GNU Lesser General Public License v3.0
137 stars 25 forks source link

Titanic Example Fails the Imputation Pipeline #668

Closed rsangole closed 2 years ago

rsangole commented 2 years ago

Description

Hello,

I'm trying to replicate the Titanic example from this post. However, I'm getting an error that one of the columns Embarked has missing values, despite building the po as posted.

Could use some guidance - am I going wrong somewhere? I've put down a reproducible example below.

Cheers!

Reproducible example

library(mlr3verse)
library(mlr3learners)
library(data.table)

set.seed(420)

data("titanic", package = "mlr3data")
setDT(titanic)

task <- as_task_classif(titanic, target = "survived", positive = "yes")
task$set_row_roles(892:1309, "holdout")
task$select(setdiff(task$feature_names, c("cabin", "name", "ticket")))

cv3 <- rsmp("cv", folds = 3L)
cv3$instantiate(task)

learner <- lrn("classif.ranger", num.trees = 250, min.node.size = 4)

poind <- po(
        "missind",
        affect_columns = selector_type(c("numeric", "integer")),
        type = "numeric"
)

gunion(list(poind, po("imputehist"))) %>>%
        po("featureunion") %>>%
        po("imputeoor") %>>%
        po("imputesample") %>>%
        po("fixfactors") %>>%
        po(learner) |>
        as_learner() -> graph_learner

rr <- resample(
        task,
        graph_learner,
        cv3,
        store_models = TRUE
)

Output

Loading required package: mlr3
data.table 1.14.2 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com
INFO  [05:35:18.820] [mlr3] Applying learner 'missind.imputehist.featureunion.imputeoor.imputesample.fixfactors.classif.ranger' on task 'titanic' (iter 2/3) 
INFO  [05:35:20.347] [mlr3] Applying learner 'missind.imputehist.featureunion.imputeoor.imputesample.fixfactors.classif.ranger' on task 'titanic' (iter 3/3) 
Error: Task 'titanic' has missing values in column(s) 'embarked', but learner 'classif.ranger' does not support this
This happened PipeOp classif.ranger's $predict()

Session Info

r$> sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] data.table_1.14.2  mlr3learners_0.5.3 mlr3verse_0.2.5    mlr3_0.13.3       

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3           paradox_0.9.0          lattice_0.20-45        listenv_0.8.0          palmerpenguins_0.1.0   digest_0.6.29          utf8_1.2.2            
 [8] parallelly_1.32.0      R6_2.5.1               ranger_0.14.1          backports_1.4.1        reprex_2.0.1           ggplot2_3.3.6          pillar_1.7.0          
[15] rlang_1.0.2            mlr3fselect_0.7.1      uuid_1.1-0             rstudioapi_0.13        R.utils_2.11.0         R.oo_1.24.0            Matrix_1.4-1          
[22] checkmate_2.1.0        styler_1.7.0           mlr3pipelines_0.4.1    munsell_0.5.0          compiler_4.2.0         pkgconfig_2.0.3        clipr_0.8.0           
[29] globals_0.15.0         mlr3tuning_0.13.1      tidyselect_1.1.2       tibble_3.1.7           mlr3data_0.6.0         lgr_0.4.3              mlr3cluster_0.1.3     
[36] mlr3misc_0.10.0        mlr3tuningspaces_0.2.0 codetools_0.2-18       clusterCrit_1.2.8      fansi_1.0.3            future_1.26.1          crayon_1.5.1          
[43] dplyr_1.0.9            withr_2.5.0            R.methodsS3_1.8.1      grid_4.2.0             jsonlite_1.8.0         gtable_0.3.0           lifecycle_1.0.1       
[50] magrittr_2.0.3         scales_1.2.0           future.apply_1.9.0     cli_3.3.0              mlr3viz_0.5.9          renv_0.15.3            fs_1.5.2              
[57] mlr3filters_0.5.0      ellipsis_0.3.2         bbotk_0.5.3            generics_0.1.2         vctrs_0.4.1            tools_4.2.0            R.cache_0.15.0        
[64] glue_1.6.2             purrr_0.3.4            parallel_4.2.0         clue_0.3-61            colorspace_2.0-3       cluster_2.1.3         
mb706 commented 2 years ago

Thanks for the report! The problem here is that po("fixfactors") will remove factor-levels that have not been seen during training. Apparently in this particular CV-split, the training set does not contain any missing values in the embarked column. While the po("imputeoor") does impute the missing values during prediction, introducing the .MISSING level, the po("fixfactors") removes them again since they were not present during training.

This is something that should be fixed by po("imputesample"), but in your code, po("imputesample") comes before po("fixfactors"). Instead, it should come afterwards:

gunion(list(poind, po("imputehist"))) %>>%
        po("featureunion") %>>%
        po("imputeoor") %>>%
        po("fixfactors") %>>%     #!!
        po("imputesample") %>>%   #!!
        po(learner) |>
        as_learner() -> graph_learner

This makes this particular example run for me. Does it solve the issue for you?

rsangole commented 2 years ago

Yep, absolutely that fixed it! T'was an oversight on my end, thanks for the correction and help!