Closed notiv closed 4 years ago
About what mlrCPO
supports and what is possible with it:
What is currently easiest is to re-run the preprocessing step on the original data as if it were new (prediction) data using the "retrafo" stored in the model, which is supposed to always give the same result as the original preprocessing:
> lrn = cpoFilterUnivariate(perc = .5) %>>% makeLearner("classif.randomForest")
> model = train(lrn, pid.task)
> retr = retrafo(model) # the trained 'retrafo' object
> orig.indata = pid.task %>>% retr
> head(getTaskData(orig.indata))
pregnant glucose pressure insulin diabetes
1 6 148 72 0 pos
2 1 85 66 0 neg
3 8 183 64 0 pos
4 1 89 66 94 neg
5 0 137 40 168 pos
6 5 116 74 0 neg
>
(Note: Unfortunately this does not work with subsampling CPO
s, and it does not work if non-CPO preprocessing is done.)
(Also note: Obviously you can not just do pid.task %>>% cpoFilterUnivariate(perc = 0.5)
and expect the same result, since some CPO
s are stochastic.)
You can also do
newdata %>>% retr
to get what the underlying randomForest
(in this case) model sees when you do
predict(model, newdata)
This works even when subsampling is done (since that only happens during training)
This only works with retrafo
if the training models are kept, i.e. models = TRUE
, and even then it is a bit cumbersome.
It is possible to save the data as it comes in to the control
object of a CPO
and retrieve it as the state of the retrafo. Using the following custom CPO
:
saver = makeCPO("saves", dataformat = "task",
cpo.train = { data }, # puts the data into the control object
cpo.retrafo = { data }) # i.e. a no-op
One can now:
> lrn = cpoFilterUnivariate(perc = 0.5) %>>% saver() %>>% makeLearner("classif.randomForest")
> model = train(lrn, pid.task)
> retr = retrafo(model)
> as.list(retr) # this is how we see the retrafo chain elements
[[1]]
CPO Retrafo chain
[RETRAFO univariate.model.score(perc = 0.5, abs = <NULL>, threshold = <NULL>, perf.learner =
<NULL>, perf.measure = <NULL>, perf.resampling = <NULL>)]
[[2]]
CPO Retrafo chain
[RETRAFO saves()]
> state = getCPOTrainedState(as.list(retr)[[2]]) # get the retrafo state
> names(state) # note that we want the **control**, not the data
[1] "control" "data"
> class(state$control)
[1] "ClassifTask" "SupervisedTask" "Task"
> # Verify that the control object is the same as what we get using the method above.
> # Note that FilterUnivariate is a stochastic method, so every time
> # you run 'train()' the result will be different.
> # The result of 'data %>>% retr' is deterministic, however (as it should be!)
> supposed.orig.indata = pid.task %>>% retr
> all.equal(supposed.orig.indata, state$control)
[1] TRUE
This does work with subsampling CPO
s, but it does not work if other (non-CPO
) wrappers are used.
This only much better with retrafo
, but similarly the training models need to be kept using models = TRUE
.
It is also relatively painless to make a custom CPO
that saves the data to a global environment variable:
savior = makeCPOExtendedTrafo("saves",
dataformat = "task",
cpo.trafo = { control = NULL; indata <<- data },
cpo.retrafo = { outdata <<- data })
# be aware that this works because the expression 'x <<- y' evaluates to the value 'y'.
We can now access the original indata in the global environment.
> lrn = cpoFilterUnivariate(perc = 0.5) %>>% savior() %>>% makeLearner("classif.randomForest")
> model = train(lrn, pid.task) # this writes the data into the "indata" variable
> class(indata)
[1] "ClassifTask" "SupervisedTask" "Task"
> supposed.orig.indata = pid.task %>>% retr
> all.equal(supposed.orig.indata, indata)
[1] TRUE
This can work well with resample
, you would have to extend the logic of storing the data to the global variables to prevent multiple calls from overwriting the last one (e.g. using indata <<- c(indata, list(data))
and initializing indata
to NULL
).
If you have an idea of how a good UI to get the data after preprocessing (or any information about preprocessing) should look like, please chime in.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
As explained in this SO thread I would like to access the data after all preprocessing steps and before training (or even in between for testing/ debugging purposes). The use cases are briefly described in my comment there:
Would something like that make sense in general? Will mlrCPO support this feature? Thanks a lot!