Closed drag05 closed 3 years ago
@drag05
Thanks for the detailed issue and code example! I just stumbled across the issue - sorry for keeping you waiting. In the following I assume that you want to impute the same features with imp_num
and imp_sam
and you want to have both newly imputed feature sets available along with all other non-missing features and the missing indicators. Please correct me if I misunderstood you.
First, there seems to actually be a bug in mlr3pipelines. I fixed it locally and will push a PR soon and link it here. With this bugfix PipeOpFeatureUnion
would actually tell you what the problem with your graph is:
Error: PipeOpFeatureUnion cannot aggregate different features sharing the same feature name.
This applies to the following features: 'glucose', 'insulin', 'mass', 'pressure', 'triceps'
I.e., both imputers do their imputation for 'glucose', 'insulin', 'mass', 'pressure', and 'triceps'
and return them with the same name, although the imputed values differ. PipeOpFeatureUnion
does not allow this. In this case, you can pass characters to the innum
argument during the construction; this will result in a prefix of the feature names, i.e.,:
graph = po("copy", length(po_list)) %>>% gunion(po_list) %>>%
po("featureunion", innum = c("missind", "hist", "sample"))
graph$train(task)[[1]]$feature_names
[1] "missind.missing_glucose" "missind.missing_insulin"
[3] "missind.missing_mass" "missind.missing_pressure"
[5] "missind.missing_triceps" "hist.age"
[7] "hist.pedigree" "hist.pregnant"
[9] "hist.glucose" "hist.insulin"
[11] "hist.mass" "hist.pressure"
[13] "hist.triceps" "sample.age"
[15] "sample.pedigree" "sample.pregnant"
[17] "sample.glucose" "sample.insulin"
[19] "sample.mass" "sample.pressure"
[21] "sample.triceps"
Note that the result still is not what you wanted because the imputers also each return the unaltered features, e.g., 'age'
but prefixed with either "hist"
or "sample"
. I am not aware of a straightforward way to achieve what you desire. One option could be the following:
PipeOpMissInd
and split the task with respect to missing and non missing features using Selector
s (selector_missing()
and inverted)PipeOpImputeHist
and PipeOpImputeSample
PipeOpFeatureUnion
and prefix the two imputer outputs correctly:graph =
gunion(
list(
po("missind"),
po("select", "select_non_miss", param_vals = list(selector = selector_invert(selector_missing()))),
po("select", "select_miss", param_vals = list(selector = selector_missing())) %>>%
gunion(list(po("imputehist"), po("imputesample")))
)
) %>>% po("featureunion", innum = c("", "", "hist", "sample"))
graph$plot()
Training on the task would give you:
graph$train(task)[[1]]$data()
diabetes missing_glucose missing_insulin missing_mass missing_pressure
1: pos present missing present present
2: neg present missing present present
3: pos present missing present present
4: neg present present present present
5: pos present present present present
---
764: neg present present present present
765: neg present missing present present
766: neg present present present present
767: pos present missing present present
768: neg present missing present present
missing_triceps age pedigree pregnant hist.glucose hist.insulin hist.mass
1: present 50 0.627 6 148 35.28455 33.6
2: present 31 0.351 1 85 143.44104 26.6
3: missing 32 0.672 8 183 86.54584 23.3
4: present 21 0.167 1 89 94.00000 28.1
5: present 33 2.288 0 137 168.00000 43.1
---
764: present 63 0.171 10 101 180.00000 32.9
765: present 27 0.340 2 122 505.18299 36.8
766: present 30 0.245 5 121 112.00000 26.2
767: missing 47 0.349 1 126 12.23604 30.1
768: present 23 0.315 1 93 225.06743 30.4
hist.pressure hist.triceps sample.glucose sample.insulin sample.mass
1: 72 35.00000 148 105 33.6
2: 66 29.00000 85 387 26.6
3: 64 40.43254 183 57 23.3
4: 66 23.00000 89 94 28.1
5: 40 35.00000 137 168 43.1
---
764: 76 48.00000 101 180 32.9
765: 70 27.00000 122 176 36.8
766: 72 23.00000 121 112 26.2
767: 60 21.05179 126 215 30.1
768: 70 31.00000 93 55 30.4
sample.pressure sample.triceps
1: 72 35
2: 66 29
3: 64 27
4: 66 23
5: 40 35
---
764: 76 48
765: 70 27
766: 72 23
767: 60 26
768: 70 31
Which should be just what you wanted to achieve. I am not sure how common it is to have different imputed versions of features in a ML workflow (because the different versions of the imputed features each would highly correlate) but we will maybe discuss this and come up with an easier solution. Please let me know if this was helpful.
@sumny
This is a very nice solution indeed!
Just to give some perspective, I am contemplating using varous imputation methods (operators) as tuning parameter along with the regular search space in a framework like this (using the pipeline infix figuratively):
select tuning instance that minimizes the performance measure:
repeat, eventually
To avoid correlations between results from various imputations - as you have mentioned - each set of imputed features should be considered only once in relation to the rest of parameters. Of course, it will be a bit more complex if various data types are involved.
It would be nice if all this could be packed in one "tuning-with-imputations" po to keep tuning simple.
Thank you!
Thanks for the feedback, I'm glad I could help!
I am not sure whether I fully understood your whole design, but it could be that you already can simplify a lot doing branching, e.g., using ppl("branch", ...)
(see ?pipeline_branch
). For example consider the following GraphLearner
:
graphlearner = GraphLearner$new(
gunion(
list(
po("missind"),
ppl("branch", graphs = list("hist" = po("imputehist"), "sample" = po("imputesample")))
)
) %>>%
po("featureunion") %>>%
lrn("classif.rpart")
)
graphlearner$graph$plot()
This graphlearner allows you to set a branch.selection
hyperparameter that specifies whether the imputation is done either using PipeOpImputeHist
or PipeOpImputeSample
:
graphlearner$param_set$params$branch.selection
id class lower upper levels default
1: branch.selection ParamFct NA NA hist,sample <NoDefault>
This parameter can be tuned like any other hyperparameter, e.g.:
library(paradox)
library(mlr3)
library(mlr3tuning)
search_space = ParamSet$new(list(
ParamFct$new("branch.selection", levels = c("hist", "sample")),
ParamInt$new("classif.rpart.maxdepth", lower = 1L, upper = 30L)))
tuner_grid = tnr("grid_search")
instance_grid = TuningInstanceSingleCrit$new(
task = task,
learner = graphlearner,
resampling = rsmp("cv", folds = 3L),
measure = msr("classif.ce"),
search_space = search_space ,
terminator = trm("none")
)
tuner_grid$optimize(instance_grid)
branch.selection classif.rpart.maxdepth learner_param_vals x_domain
1: hist 20 <list> <list>
classif.ce
1: 0.2447917
@drag05 can I close this issue for now?
@sumny
Appologies for lack of replay, I am being tangled in some semantic.shiny
issues (to which I am also new!).
It should be closed.
I had tried something similar to your last solution before started writing in Github but obviously did something wrong then. I will follow your solution and keep posted if necessary.
One final, mostly rethorical, question:
featureunion
at the end of graphlearner
trigger the colinearity issues mentioned earlier?Thank you for your time!
While running the code for the classification example "Impute Missing Variables", I added a third imputation operator, namely
mlr_pipeops_imputesample
.My
graph
operator - before adding the learner - looks like this:where the pipe operators list is
and the extracted indices are:
as suggested in the article.
As a short reminder, the dataset used in this article refers to diabetes cases of pima indians:
The data has missing values in some of the variables
When training the graph with all three operators, I get the following error:
while the graph plot shows all three operators
Selecting only one operator at a time to pair with
missind
, for exampleimputehist
imputes the missing data just as in the example:and the graph is plotted correctly
Pairing
missind
withimputesample
also works:as well as the plot:
Removing
missind
from the list throws the same error as before.While using
missind
, I tried different selectors forimputesample
such asselector_type
,selector_name
(naming only variables with missing data),selector_grep
(idem), to no avail. The only selector that worked wasselector_missing()
.What is necessary for having more than one imputation operators work simultaneously with
missind
? Please advise!