mlr-org / mlr3gallery

Case studies using mlr3
https://mlr3gallery.mlr-org.com
21 stars 9 forks source link

Impute Missing Variables article: `mlr_pipeops_missind` seems to accept only one other imputation operator at a time #77

Closed drag05 closed 3 years ago

drag05 commented 3 years ago

While running the code for the classification example "Impute Missing Variables", I added a third imputation operator, namely mlr_pipeops_imputesample.

My graph operator - before adding the learner - looks like this:

graph = po('copy', length(po_list)) %>>% gunion(po_list) %>>% po('featureunion')

where the pipe operators list is

po_list <- list(

  imp_missing <- po('missind')
, imp_num <- po('imputehist', param_vals = list(affect_columns = selector_type('numeric')))
, imp_samp <- po('imputesample', param_vals = list(affect_columns = selector_missing()))

)

and the extracted indices are:

ids <- map_chr(po_list, `[[`, 'id')

[1] "missind"      "imputehist"   "imputesample"

as suggested in the article.

As a short reminder, the dataset used in this article refers to diabetes cases of pima indians:

> task$data()
     diabetes age glucose insulin mass pedigree pregnant pressure triceps
  1:      pos  50     148      NA 33.6    0.627        6       72      35
  2:      neg  31      85      NA 26.6    0.351        1       66      29
  3:      pos  32     183      NA 23.3    0.672        8       64      NA
  4:      neg  21      89      94 28.1    0.167        1       66      23
  5:      pos  33     137     168 43.1    2.288        0       40      35

The data has missing values in some of the variables

> task$missings()

diabetes      age  glucose  insulin     mass pedigree pregnant pressure  triceps 
       0        0        5      374       11        0        0       35      227 

When training the graph with all three operators, I get the following error:

> graph$train(task)

 Error in task_data(self, rows, cols, data_format, ordered) : 
  Assertion on 'cols' failed: Must be a subset of {'diabetes','missing_glucose','missing_insulin','missing_mass','missing_pressure','missing_triceps'}, but is {'age'}. 

while the graph plot shows all three operators

image

Selecting only one operator at a time to pair with missind, for example imputehist imputes the missing data just as in the example:

ids <- map_chr(po_list, `[[`, 'id')

[1] "missind"      "imputehist"

> graph$train(task)[[1]]$data()

   diabetes missing_glucose missing_insulin missing_mass missing_pressure
  1:      pos         present         missing      present          present
  2:      neg         present         missing      present          present
  3:      pos         present         missing      present          present
  4:      neg         present         present      present          present
  5:      pos         present         present      present          present
    missing_triceps age pedigree pregnant glucose   insulin mass pressure  triceps
  1:         present  50    0.627        6     148 118.83233 33.6       72 35.00000
  2:         present  31    0.351        1      85 212.17043 26.6       66 29.00000
  3:         missing  32    0.672        8     183  29.31409 23.3       64 10.59325
  4:         present  21    0.167        1      89  94.00000 28.1       66 23.00000
  5:         present  33    2.288        0     137 168.00000 43.1       40 35.00000

and the graph is plotted correctly

image

Pairing missind with imputesample also works:

> ids
[1] "missind"      "imputesample"

> graph$train(task)[[1]]$data()

     diabetes missing_glucose missing_insulin missing_mass missing_pressure
  1:      pos         present         missing      present          present
  2:      neg         present         missing      present          present
  3:      pos         present         missing      present          present
  4:      neg         present         present      present          present
  5:      pos         present         present      present          present
       missing_triceps age pedigree pregnant glucose insulin mass pressure triceps
  1:         present  50    0.627        6     148     116 33.6       72      35
  2:         present  31    0.351        1      85     168 26.6       66      29
  3:         missing  32    0.672        8     183     108 23.3       64       7
  4:         present  21    0.167        1      89      94 28.1       66      23
  5:         present  33    2.288        0     137     168 43.1       40      35

as well as the plot:

image

Removing missind from the list throws the same error as before.

While using missind, I tried different selectors for imputesample such as selector_type, selector_name (naming only variables with missing data), selector_grep (idem), to no avail. The only selector that worked was selector_missing().

What is necessary for having more than one imputation operators work simultaneously with missind? Please advise!

sumny commented 3 years ago

@drag05 Thanks for the detailed issue and code example! I just stumbled across the issue - sorry for keeping you waiting. In the following I assume that you want to impute the same features with imp_num and imp_sam and you want to have both newly imputed feature sets available along with all other non-missing features and the missing indicators. Please correct me if I misunderstood you.

First, there seems to actually be a bug in mlr3pipelines. I fixed it locally and will push a PR soon and link it here. With this bugfix PipeOpFeatureUnion would actually tell you what the problem with your graph is:

Error: PipeOpFeatureUnion cannot aggregate different features sharing the same feature name.
This applies to the following features: 'glucose', 'insulin', 'mass', 'pressure', 'triceps'

I.e., both imputers do their imputation for 'glucose', 'insulin', 'mass', 'pressure', and 'triceps' and return them with the same name, although the imputed values differ. PipeOpFeatureUnion does not allow this. In this case, you can pass characters to the innum argument during the construction; this will result in a prefix of the feature names, i.e.,:

graph = po("copy", length(po_list)) %>>% gunion(po_list) %>>%
  po("featureunion", innum = c("missind", "hist", "sample"))
graph$train(task)[[1]]$feature_names
 [1] "missind.missing_glucose"  "missind.missing_insulin" 
 [3] "missind.missing_mass"     "missind.missing_pressure"
 [5] "missind.missing_triceps"  "hist.age"                
 [7] "hist.pedigree"            "hist.pregnant"           
 [9] "hist.glucose"             "hist.insulin"            
[11] "hist.mass"                "hist.pressure"           
[13] "hist.triceps"             "sample.age"              
[15] "sample.pedigree"          "sample.pregnant"         
[17] "sample.glucose"           "sample.insulin"          
[19] "sample.mass"              "sample.pressure"         
[21] "sample.triceps"

Note that the result still is not what you wanted because the imputers also each return the unaltered features, e.g., 'age' but prefixed with either "hist" or "sample". I am not aware of a straightforward way to achieve what you desire. One option could be the following:

graph = 
  gunion(
    list(
      po("missind"),
      po("select", "select_non_miss", param_vals = list(selector = selector_invert(selector_missing()))),
      po("select", "select_miss", param_vals = list(selector = selector_missing())) %>>%
        gunion(list(po("imputehist"), po("imputesample")))
    )
  ) %>>% po("featureunion", innum = c("", "", "hist", "sample"))

graph$plot()

image

Training on the task would give you:

graph$train(task)[[1]]$data()
     diabetes missing_glucose missing_insulin missing_mass missing_pressure
  1:      pos         present         missing      present          present
  2:      neg         present         missing      present          present
  3:      pos         present         missing      present          present
  4:      neg         present         present      present          present
  5:      pos         present         present      present          present
 ---                                                                       
764:      neg         present         present      present          present
765:      neg         present         missing      present          present
766:      neg         present         present      present          present
767:      pos         present         missing      present          present
768:      neg         present         missing      present          present
     missing_triceps age pedigree pregnant hist.glucose hist.insulin hist.mass
  1:         present  50    0.627        6          148     35.28455      33.6
  2:         present  31    0.351        1           85    143.44104      26.6
  3:         missing  32    0.672        8          183     86.54584      23.3
  4:         present  21    0.167        1           89     94.00000      28.1
  5:         present  33    2.288        0          137    168.00000      43.1
 ---                                                                          
764:         present  63    0.171       10          101    180.00000      32.9
765:         present  27    0.340        2          122    505.18299      36.8
766:         present  30    0.245        5          121    112.00000      26.2
767:         missing  47    0.349        1          126     12.23604      30.1
768:         present  23    0.315        1           93    225.06743      30.4
     hist.pressure hist.triceps sample.glucose sample.insulin sample.mass
  1:            72     35.00000            148            105        33.6
  2:            66     29.00000             85            387        26.6
  3:            64     40.43254            183             57        23.3
  4:            66     23.00000             89             94        28.1
  5:            40     35.00000            137            168        43.1
 ---                                                                     
764:            76     48.00000            101            180        32.9
765:            70     27.00000            122            176        36.8
766:            72     23.00000            121            112        26.2
767:            60     21.05179            126            215        30.1
768:            70     31.00000             93             55        30.4
     sample.pressure sample.triceps
  1:              72             35
  2:              66             29
  3:              64             27
  4:              66             23
  5:              40             35
 ---                               
764:              76             48
765:              70             27
766:              72             23
767:              60             26
768:              70             31

Which should be just what you wanted to achieve. I am not sure how common it is to have different imputed versions of features in a ML workflow (because the different versions of the imputed features each would highly correlate) but we will maybe discuss this and come up with an easier solution. Please let me know if this was helpful.

drag05 commented 3 years ago

@sumny

This is a very nice solution indeed!

Just to give some perspective, I am contemplating using varous imputation methods (operators) as tuning parameter along with the regular search space in a framework like this (using the pipeline infix figuratively):

select tuning instance that minimizes the performance measure:

repeat, eventually

To avoid correlations between results from various imputations - as you have mentioned - each set of imputed features should be considered only once in relation to the rest of parameters. Of course, it will be a bit more complex if various data types are involved.

It would be nice if all this could be packed in one "tuning-with-imputations" po to keep tuning simple.

Thank you!

sumny commented 3 years ago

Thanks for the feedback, I'm glad I could help!

I am not sure whether I fully understood your whole design, but it could be that you already can simplify a lot doing branching, e.g., using ppl("branch", ...) (see ?pipeline_branch). For example consider the following GraphLearner:

graphlearner = GraphLearner$new(
  gunion(
    list(
      po("missind"),
      ppl("branch", graphs = list("hist" = po("imputehist"), "sample" = po("imputesample")))
    )
  ) %>>%
  po("featureunion") %>>%
  lrn("classif.rpart")
)
graphlearner$graph$plot()

image

This graphlearner allows you to set a branch.selection hyperparameter that specifies whether the imputation is done either using PipeOpImputeHist or PipeOpImputeSample:

graphlearner$param_set$params$branch.selection
                 id    class lower upper      levels     default
1: branch.selection ParamFct    NA    NA hist,sample <NoDefault>

This parameter can be tuned like any other hyperparameter, e.g.:

library(paradox)
library(mlr3)
library(mlr3tuning)

search_space = ParamSet$new(list(
  ParamFct$new("branch.selection", levels = c("hist", "sample")),
  ParamInt$new("classif.rpart.maxdepth", lower = 1L, upper = 30L)))

tuner_grid = tnr("grid_search")

instance_grid = TuningInstanceSingleCrit$new(
  task = task,
  learner = graphlearner,
  resampling = rsmp("cv", folds = 3L),
  measure = msr("classif.ce"),
  search_space = search_space ,
  terminator = trm("none")
)

tuner_grid$optimize(instance_grid)
   branch.selection classif.rpart.maxdepth learner_param_vals x_domain
1:             hist                     20             <list>   <list>
   classif.ce
1:  0.2447917
sumny commented 3 years ago

@drag05 can I close this issue for now?

drag05 commented 3 years ago

@sumny

Appologies for lack of replay, I am being tangled in some semantic.shiny issues (to which I am also new!).

It should be closed.

I had tried something similar to your last solution before started writing in Github but obviously did something wrong then. I will follow your solution and keep posted if necessary.

One final, mostly rethorical, question:

  1. Would featureunion at the end of graphlearner trigger the colinearity issues mentioned earlier?

Thank you for your time!