mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.65k stars 405 forks source link

Is there a way to fuse a learner with a *chain* of filter methods? #1386

Closed DrAndiLowe closed 4 years ago

DrAndiLowe commented 7 years ago

Hello,

I'd like to know if there is a way that I can fuse a learner to more than one filter method. The reason I want to do this is that different custom filters have different computational costs. I want to do multi-level cascade filtering, whereby I do easy fast feature rejection first, and do slower and harder feature rejection in subsequent steps. At the end I have a filter that is very computationally expensive, and I can only run it on a small feature set, so I want to do filtering before. The idea is to make progressively finer selections in a stepwise manner. It would look something like this:

lrn <- makeLearner("classif.glmnet", predict.type = "prob")
...
lrn <- makeFilterWrapper(lrn, fw.method = "Filt1", fw.threshold = 0.5)
lrn <- makeFilterWrapper(lrn, fw.method = "Filt2", fw.threshold = 0.5)
lrn <- makeFilterWrapper(lrn, fw.method = "Filt3", fw.threshold = 0.5)
lrn <- makePreprocWrapperCaret(lrn,
                               ppc.center = TRUE, 
                               ppc.scale = TRUE,
                               ppc.YeoJohnson = TRUE)
lrn <- makeFilterWrapper(lrn, fw.method = "Filt4", fw.threshold = 0.5)

Is this feasible? If so, how? I haven't been able to get the naive approach above to work. I have also tried:

lrn <- makeFilterWrapper(lrn, fw.method = c("zeroFilt","dupFilt"), fw.threshold = 0.5)

Which resulted in:

Assertion on 'fw.method' failed: Must be element of set {'anova.test','carscore','cforest.importance','chi.squared','dupFilt','gain.ratio','information.gain','kruskal.test','linear.correlation','mrmr','oneR','permutation.importance','rank.correlation','relief','rf.importance','rf.min.depth','symmetrical.uncertainty','univariate','univariate.model.score','variance','zeroFilt'}.

So it seems this approach doesn't work either. It seems fw.method can only be length 1.

Any ideas?

With many thanks,

Andrew.

larskotthoff commented 7 years ago

Hmm, in principle the first method should work, but fails in practice because the parameters of the wrapped learners have the same names.

So no, unfortunately this isn't supported at the moment and adding support for it isn't entirely straightforward.

DrAndiLowe commented 7 years ago

OK, thanks Lars. The primary motivation for doing this is that there are feature selection algorithms that I'd like to use, but they're just too slow on my hardware, but are computationally feasible if I reduce the feature set with a simpler algorithm that rejects easy cases that are obviously not predictive. A secondary motivation, as alluded to in my first example, is that I want to do some feature selection before feature transformations, and then some more feature selection. My current workaround is to have a humongous feature selection function in a custom filter, and the first bit of feature selection can't be part of CV because I can't fuse it to a learner. (I'm making some hand-waving arguments the information leakage is minimal as these first selections are unsupervised and based only on the prevalence of unique values and NAs in the data -- basically it's a low variance filter.)

It wasn't clear to me from the mlr documentation that I must include the name of every package that is used by my custom filter, else this error occurs:

Error in requireNamespace(pack, quietly = TRUE) : attempt to use zero-length variable name Calls: ... mapply -> -> suppressor -> requireNamespace In addition: There were 12 warnings (use warnings() to see them) Tracing (structure(function () .... on entry

I see now that the pkg slot can hold a vector of package names. Perhaps this could be stated more prominently in the documentation? Might be helpful to others.

Thanks again!

larskotthoff commented 7 years ago

Thanks, we'll have a look (and of course you're welcome to make a PR!). We'll also have a look at making custom filter chains easier, but no promises.

mb706 commented 5 years ago

This is possible with mlrCPO:

lrn <- cpoFilterFeatures(method = "variance", abs = 3, export = "abs", id = "outer") %>>%
  cpoFilterFeatures(method = "relief", abs = 2, export = "abs", id = "inner") %>>%
  makeLearner("classif.glmnet")

This creates a classif.glmnet learner that filters features first by "variance" and then by "relief". Note the id parameter, which prevents parameter name collisions, and the export parameter, which makes sure that only the abs parameter of each filter (i.e. number of parameters selected) is exported (this could also be perc or threshold; the value given during initialization would need to be the same). "Exporting" makes sure that the parameter is available for tuning:

> getParamSet(lrn)
                          Type  len     Def                 Constr Req Tunable Trafo
outer.abs              integer    -  <NULL>               0 to Inf   -    TRUE     -
inner.abs              integer    -  <NULL>               0 to Inf   -    TRUE     -
alpha                  numeric    -       1                 0 to 1   -    TRUE     -
[...]

> getHyperPars(lrn)
$s
[1] 0.01

$outer.abs
[1] 3

$inner.abs
[1] 2

A trained model can be inspected using retrafo() and getCPOTrainedState(). It is necessary to use as.list() to index into the chain of retrafos to select the one to inspect. The following example inspects the outer CPO and finds out which three features were selected:

> model <- train(lrn, pid.task)
> model
Model for learner.id=classif.glmnet.filterFeatures.filterFeatures; learner.class=CPOLearner
Trained on: task.id = PimaIndiansDiabetes-example; obs = 768; features = 8
Hyperparameters: s=0.01,outer.abs=3,inner.abs=2
> retrafo(model)
CPO Retrafo chain
[RETRAFO filterFeatures(abs = 3)] =>
[RETRAFO filterFeatures(abs = 2)]
> getCPOTrainedState(as.list(retrafo(model))[[1]])
$abs
[1] 3

$method
[1] "variance"

$fval
NULL

$perc
NULL

$threshold
NULL

$filter.args
list()

$control
[1] "glucose"  "pressure" "insulin" 

[...]
stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.