juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

LASSO regression using tidymodels and #TidyTuesday data for The Office | Julia Silge #8

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

LASSO regression using tidymodels and #TidyTuesday data for The Office | Julia Silge

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using this week’s #TidyTuesday dataset on The Office to show how to build a lasso regression model and choose regularization parameters!

https://juliasilge.com/blog/lasso-the-office/

juliasilge commented 11 months ago

@Pancreatologist Using variable importance to select features is definitely something that folks do! However, it's important to note that what you are doing here is supervised feature selection and you need to be very careful in how you "spend your data budget" for this to avoid overfitting. Check out this chapter of Max's book with Kjell Johnson on this topic.

Pancreatologist commented 8 months ago

Hi Julia, thanks a lot. I think the reason of my results not satisfied (with low AUC of test set) is that my budget is not sufficient. As the chapter said, LASSO is vulnerable to correlated predictors, and I need to remove the correlated predictors. But I don’t know how to remove. Should I do a correlation analysis and give a cutoff like 0.9, and remove the predictors above cutoff value? Or I just need to use the step_corr(all_predictors())? I know maybe my question is too silly. Could you please give me further suggestions. Thanks again.

juliasilge commented 8 months ago

@Pancreatologist Yes, step_corr() can be used to remove correlated predictors.

gac813 commented 8 months ago

Not sure if mentioned but wouldn't using the 'playoffs' variable be considered "leaky" since that information is only determined after the regular season has ended and is not available at the time of the prediction?

FLafont commented 3 weeks ago

Thank you for the tutorial. Is there a reason why lambda_grid <- grid_regular(penalty(), levels = 50) yields penalty values that are maxed at 1 ? What if the optimal lambda seems to be higher in our case ? How would you recommend testing this possibility in the tidymodels framework? Thanks a lot in advance.

juliasilge commented 3 weeks ago

@FLafont That's the default here for penalty(): https://dials.tidymodels.org/reference/penalty.html

That's a pretty good set of defaults for many situations, but absolutely you can set the range differently. I typically do this if my hyperparameter tuning results show that I haven't quite caught the optimal value for for the penalization.

FLafont commented 3 weeks ago

I see thanks. What is strange in my case is that although I get a "reasonable" result when I fix lambda <=2 with best lambda at 1.24, I still get slight improvements in terms of RMSE when lambda goes to 15.

One thing that could help me decide is to look at which variables are kept by the model with the different lambda values, but I don't know how to do this with tidy models except by using library(vip) as you did in the blogost. Is there a more built-in way to extract the coefficients or the variables directly from the last workflow? Thanks a lot.

juliasilge commented 3 weeks ago

@FLafont you can either use vip or you can use any built-in method for the underlying glmnet object, via extract_fit_engine(). These extractors help you get out the underlying components of the tidymodels object so you can manipulate them as needed. Also you might want to check out https://www.tidymodels.org/learn/models/coefficients/#more-complex-a-glmnet-model.