Access data between or after preprocessing steps

notiv commented 7 years ago

As explained in this SO thread I would like to access the data after all preprocessing steps and before training (or even in between for testing/ debugging purposes). The use cases are briefly described in my comment there:

The reason why I'm using the wrappers, e.g. custom ones, not mentioned above, is to consolidate the training and scoring code in one function while passing arguments between the two (as well as perform hyperparameter tuning if necessary). However, testing/debugging the code within the "real" workflow is often as useful as using unit tests. And there are cases where a third package, e.g. in my case the xgboostExplainer (medium.com/applied-data-science/…), requires the preprocessed training data.

Would something like that make sense in general? Will mlrCPO support this feature? Thanks a lot!

mb706 commented 6 years ago

About what mlrCPO supports and what is possible with it:

Standard way to get input data

What is currently easiest is to re-run the preprocessing step on the original data as if it were new (prediction) data using the "retrafo" stored in the model, which is supposed to always give the same result as the original preprocessing:

> lrn = cpoFilterUnivariate(perc = .5) %>>% makeLearner("classif.randomForest")
> model = train(lrn, pid.task)
> retr = retrafo(model)  # the trained 'retrafo' object
> orig.indata = pid.task %>>% retr
> head(getTaskData(orig.indata))
  pregnant glucose pressure insulin diabetes
1        6     148       72       0      pos
2        1      85       66       0      neg
3        8     183       64       0      pos
4        1      89       66      94      neg
5        0     137       40     168      pos
6        5     116       74       0      neg
>

(Note: Unfortunately this does not work with subsampling CPOs, and it does not work if non-CPO preprocessing is done.)

(Also note: Obviously you can not just do pid.task %>>% cpoFilterUnivariate(perc = 0.5) and expect the same result, since some CPOs are stochastic.)

You can also do

newdata %>>% retr

to get what the underlying randomForest (in this case) model sees when you do

predict(model, newdata)

This works even when subsampling is done (since that only happens during training)

This only works with retrafo if the training models are kept, i.e. models = TRUE, and even then it is a bit cumbersome.

Custom CPO that does nothing but save input data

It is possible to save the data as it comes in to the control object of a CPO and retrieve it as the state of the retrafo. Using the following custom CPO:

saver = makeCPO("saves", dataformat = "task",
  cpo.train = { data },  # puts the data into the control object
  cpo.retrafo = { data })  # i.e. a no-op

One can now:

> lrn = cpoFilterUnivariate(perc = 0.5) %>>% saver() %>>% makeLearner("classif.randomForest")
> model = train(lrn, pid.task)
> retr = retrafo(model)
> as.list(retr)  # this is how we see the retrafo chain elements
[[1]]
CPO Retrafo chain
[RETRAFO univariate.model.score(perc = 0.5, abs = <NULL>, threshold = <NULL>, perf.learner = 
<NULL>, perf.measure = <NULL>, perf.resampling = <NULL>)]

[[2]]
CPO Retrafo chain
[RETRAFO saves()]
> state = getCPOTrainedState(as.list(retr)[[2]])  # get the retrafo state
> names(state)  # note that we want the **control**, not the data
[1] "control" "data"
> class(state$control)
[1] "ClassifTask"    "SupervisedTask" "Task" 
> # Verify that the control object is the same as what we get using the method above.
> # Note that FilterUnivariate is a stochastic method, so every time
> # you run 'train()' the result will be different.
> # The result of 'data %>>% retr' is deterministic, however (as it should be!)
> supposed.orig.indata = pid.task %>>% retr
> all.equal(supposed.orig.indata, state$control)
[1] TRUE

This does work with subsampling CPOs, but it does not work if other (non-CPO) wrappers are used.

This only much better with retrafo, but similarly the training models need to be kept using models = TRUE.

Hacky way of using global variables

It is also relatively painless to make a custom CPO that saves the data to a global environment variable:

savior = makeCPOExtendedTrafo("saves",
  dataformat = "task",
  cpo.trafo = { control = NULL; indata <<- data },
  cpo.retrafo = { outdata <<- data })
# be aware that this works because the expression 'x <<- y' evaluates to the value 'y'.

We can now access the original indata in the global environment.

> lrn = cpoFilterUnivariate(perc = 0.5) %>>% savior() %>>% makeLearner("classif.randomForest")
> model = train(lrn, pid.task)  # this writes the data into the "indata" variable
> class(indata)
[1] "ClassifTask"    "SupervisedTask" "Task"
> supposed.orig.indata = pid.task %>>% retr
> all.equal(supposed.orig.indata, indata)
[1] TRUE

This can work well with resample, you would have to extend the logic of storing the data to the global variables to prevent multiple calls from overwriting the last one (e.g. using indata <<- c(indata, list(data)) and initializing indata to NULL).

PS

If you have an idea of how a good UI to get the data after preprocessing (or any information about preprocessing) should look like, please chime in.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mlr-org / mlr