Closed HongxiangXu closed 2 years ago
You would use mlr3filters here, in particular the FilterSelectedFeatures
. For a Filter
, You have to specify the number of features, the fraction of features, or the "cutoff-value". However, the way how FilterSelectedFeatures
currently works is that it assigns a score-value of 1
to all features selected by a learner, so you should set the cutoff to 1; setting nfeat
to some specific value will only add / remove random (?) features to this set of selected features, without being informed by the selection algorithm.
library('mlr3')
library('mlr3pipelines')
library('mlr3filters') # for flt() / FilterSelectedFeatures
library('mlr3learners') # for cv_glmnet
set.seed(3)
# Create FilterSelectedFeatures
f = flt("selected_features", learner = lrn("classif.cv_glmnet"))
# conver to PipeOp, set cutoff to 1
po_f = po(f, filter.cutoff = 1)
# build some arbitrary learner
gl = as_learner(po_f %>>% lrn("classif.log_reg"))
gl$train(tsk("sonar"))
# look at the created model: which features were retained and given to log_reg?
gl$graph_model$pipeops$classif.log_reg$learner_model$model$coefficients
#> (Intercept) V11 V12 V21 V22 V36
#> -4.4589291 3.6729705 3.2792197 0.7161011 1.4257925 -3.7313518
#> V4 V45 V49 V52
#> 6.7106697 7.8432847 12.2258373 51.0595149
Another (advanced) way that makes use of the selector
(-> ?Selector
) input of PipeOpSelect
is:
set.seed(3)
# selector function: <Task> --> <character>, indicating selected features
s = function(task) lrn("classif.cv_glmnet")$train(task)$selected_features()
# using PipeOpSelect
po_s = po("select", selector = s)
# the rest from here as before
gl = as_learner(po_s %>>% lrn("classif.log_reg"))
gl$train(tsk("sonar"))
gl$graph_model$pipeops$classif.log_reg$learner_model$model$coefficients
#> (Intercept) V11 V12 V21 V22 V36
#> -4.4589291 3.6729705 3.2792197 0.7161011 1.4257925 -3.7313518
#> V4 V45 V49 V52
#> 6.7106697 7.8432847 12.2258373 51.0595149
the s
function is more versatile here, it is e.g. possible to give a lambda
-argument to $selected_features()
call here.
Thanks very much! I solved my problem follow your code, and I think the second way you provided are more flexible.
How could I use cv_glmnet or other machine learning methods such as xgboost to filter features so that I could reduce feature counts? It is a learner now so that it could not be added as a filter in pipeline