mlr-org / mlr3pipelines

Dataflow Programming for Machine Learning in R
https://mlr3pipelines.mlr-org.com/
GNU Lesser General Public License v3.0
140 stars 25 forks source link

how could make learner of xgboost or cv_glmnet act as a filter in a pipline? #675

Closed HongxiangXu closed 2 years ago

HongxiangXu commented 2 years ago

How could I use cv_glmnet or other machine learning methods such as xgboost to filter features so that I could reduce feature counts? It is a learner now so that it could not be added as a filter in pipeline

mb706 commented 2 years ago

You would use mlr3filters here, in particular the FilterSelectedFeatures. For a Filter, You have to specify the number of features, the fraction of features, or the "cutoff-value". However, the way how FilterSelectedFeatures currently works is that it assigns a score-value of 1 to all features selected by a learner, so you should set the cutoff to 1; setting nfeat to some specific value will only add / remove random (?) features to this set of selected features, without being informed by the selection algorithm.

library('mlr3')
library('mlr3pipelines')
library('mlr3filters')   # for flt() / FilterSelectedFeatures
library('mlr3learners')  # for cv_glmnet
set.seed(3)
# Create FilterSelectedFeatures
f = flt("selected_features", learner = lrn("classif.cv_glmnet"))
# conver to PipeOp, set cutoff to 1
po_f = po(f, filter.cutoff = 1)
# build some arbitrary learner
gl = as_learner(po_f %>>% lrn("classif.log_reg"))
gl$train(tsk("sonar"))

# look at the created model: which features were retained and given to log_reg?
gl$graph_model$pipeops$classif.log_reg$learner_model$model$coefficients 
#> (Intercept)         V11         V12         V21         V22         V36 
#>  -4.4589291   3.6729705   3.2792197   0.7161011   1.4257925  -3.7313518 
#>          V4         V45         V49         V52 
#>   6.7106697   7.8432847  12.2258373  51.0595149 

Another (advanced) way that makes use of the selector (-> ?Selector) input of PipeOpSelect is:

set.seed(3)
# selector function: <Task> --> <character>, indicating selected features
s = function(task) lrn("classif.cv_glmnet")$train(task)$selected_features()
# using PipeOpSelect
po_s = po("select", selector = s)

# the rest from here as before
gl = as_learner(po_s %>>% lrn("classif.log_reg"))
gl$train(tsk("sonar"))
gl$graph_model$pipeops$classif.log_reg$learner_model$model$coefficients
#> (Intercept)         V11         V12         V21         V22         V36 
#>  -4.4589291   3.6729705   3.2792197   0.7161011   1.4257925  -3.7313518 
#>          V4         V45         V49         V52 
#>   6.7106697   7.8432847  12.2258373  51.0595149 

the s function is more versatile here, it is e.g. possible to give a lambda-argument to $selected_features() call here.

HongxiangXu commented 2 years ago

Thanks very much! I solved my problem follow your code, and I think the second way you provided are more flexible.