Open mb706 opened 3 years ago
If we want to do something based on the variance inflation factor, we could probably integrate this as a filter, i.e., using the negative vif:
task = tsk("mtcars")
filter = flt("vic")
g = po("filter", filter, filter.cutoff = -10) %>>% lrn("regr.lm")
l = lrn("regr.lm")
bg = benchmark_grid(task, list(g, l), rsmp("cv"))
b = benchmark(bg)
b$aggregate()
nr resample_result task_id learner_id resampling_id iters regr.mse
1: 1 <ResampleResult[21]> mtcars vic.regr.lm cv 10 9.091071
2: 2 <ResampleResult[21]> mtcars regr.lm cv 10 13.299961
Making this available as a filter probably makes sense. I don't know vif, but it looks like it is dependent on the task target, while there should be something useful even without taking the target into consideration. Example usecase is if there is a PipeOpLearnerCV that outputs probabilities, where one probability column is often just 1 - sum(other probabilities)
and leads to warnings if this is the input to a simple linear model. The filter would probably go from left to right through the task features and measure how collinear it is to the already seen features. The filter value would then be similar to a tolerance, so slightly non-collinear features are also excluded (that could still lead to instability in some models).
somehow automatically recognize when a column is close to collinear with another column and drop it; could be useful for linear models