Closed ZWael closed 6 months ago
Same result with some code modification in the intention to fit the model only once
final_mod <- finalize_workflow(wf,L1)%>%
fit(train_d)
vars1=vip::vi(final_mod%>%extract_fit_parsnip(),lambda =L1$penalty)
vars2=vip::vi(final_mod%>%extract_fit_parsnip(),lambda =L2$penalty)
v1=vars1%>%filter(Importance!=0)%>%pull(Variable)
v2=vars2%>%filter(Importance!=0)%>%pull(Variable)
table(v1%in%v2)
#>
#> FALSE TRUE
#> 2 25
Created on 2024-05-24 with reprex v2.1.0
If I am not mistaken, selected variable with the higher penalty should be included among the variables selected with the lowest penalty.
This is a reasonable hypothesis, but turns out to actually be untrue! The path of coefficient estimates as the penalty varies does not guarantee that the variables included at a higher penalty will persist at a lower penalty; the process is not strictly nested. The actual set of variables with non-zero coefficients changes due to the interplay of penalty strength, variable correlation, and the optimization algorithm.
Here's an example with glmnet itself, no tidymodels involved:
library(dplyr)
library(vip)
library(glmnet) # for the toy dataset
#> Loading required package: Matrix
#>
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#>
#> expand, pack, unpack
#> Loaded glmnet 4.1-8
data(QuickStartExample) # glmnet toy data
set.seed(1234)
# we add some vars and noise to simulate data with p>>>n
x <- replicate(8, {
QuickStartExample$x+rnorm(2000)
})
QuickStartExample$x = cbind(QuickStartExample$x,
x[,,1],x[,,2],x[,,3],x[,,4],
x[,,5],x[,,6],x[,,7],x[,,8]
)
glmnet_fit <- glmnet(QuickStartExample$x, QuickStartExample$y, alpha = 1)
vars1 = vip::vi(glmnet_fit, lambda = .2) %>% filter(Importance!=0) %>% pull(Variable)
vars2 = vip::vi(glmnet_fit, lambda = .1) %>% filter(Importance!=0) %>% pull(Variable)
table(vars1 %in% vars2)
#>
#> FALSE TRUE
#> 2 15
Created on 2024-05-24 with reprex v2.1.0
Thank you for the feedback
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
I was trying to use glmnet for feature selection.
I used the code below inspired from Julia Silge's blog post (many thanks to her) https://juliasilge.com/blog/lasso-the-office/ So I was selecting variables with 2 different lambda. If I am not mistaken, selected variable with the higher penalty should be included among the variables selected with the lowest penalty. which isn't the case in my data and reproduced in the toy data below.
The final model with finalize_workflow() should no longer variate ?
Created on 2024-05-24 with reprex v2.1.0