mlr-org / mlr3pipelines

Dataflow Programming for Machine Learning in R
https://mlr3pipelines.mlr-org.com/
GNU Lesser General Public License v3.0
137 stars 25 forks source link

can we create a pipeopdropcollinear #572

Open mb706 opened 3 years ago

mb706 commented 3 years ago

somehow automatically recognize when a column is close to collinear with another column and drop it; could be useful for linear models

sumny commented 3 years ago

If we want to do something based on the variance inflation factor, we could probably integrate this as a filter, i.e., using the negative vif:

task = tsk("mtcars")

filter = flt("vic")
g = po("filter", filter, filter.cutoff = -10) %>>% lrn("regr.lm")
l = lrn("regr.lm")

bg = benchmark_grid(task, list(g, l), rsmp("cv"))
b = benchmark(bg)
b$aggregate()
   nr      resample_result task_id  learner_id resampling_id iters  regr.mse
1:  1 <ResampleResult[21]>  mtcars vic.regr.lm            cv    10  9.091071
2:  2 <ResampleResult[21]>  mtcars     regr.lm            cv    10 13.299961
mb706 commented 3 years ago

Making this available as a filter probably makes sense. I don't know vif, but it looks like it is dependent on the task target, while there should be something useful even without taking the target into consideration. Example usecase is if there is a PipeOpLearnerCV that outputs probabilities, where one probability column is often just 1 - sum(other probabilities) and leads to warnings if this is the input to a simple linear model. The filter would probably go from left to right through the task features and measure how collinear it is to the already seen features. The filter value would then be similar to a tolerance, so slightly non-collinear features are also excluded (that could still lead to instability in some models).