juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

LASSO regression using tidymodels and #TidyTuesday data for The Office | Julia Silge #8

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

LASSO regression using tidymodels and #TidyTuesday data for The Office | Julia Silge

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using this week’s #TidyTuesday dataset on The Office to show how to build a lasso regression model and choose regularization parameters!

https://juliasilge.com/blog/lasso-the-office/

duccioa commented 3 years ago

Hello, thank you for this nice tutorial. It's very clear and useful. One question, you define office_prep but then it is not used. Where would you use a variable issued by the function prep() ?

juliasilge commented 3 years ago

@duccioa You typically don't need to use prep() or bake() if you are bundling together a model and recipe in a workflow because it takes care of it under the hood, but it is good to know how to use prep() as an exploration/debugging tool. Data preprocessing doesn't always go the way you expect, so being able to do it yourself and see the output is helpful. You can read more about using recipes here.

duccioa commented 3 years ago

@juliasilge Thank you for your answer !

albert-ying commented 3 years ago

Hi Julia, I'm a little bit lost about tuning the penalty. Based on glmnet and this github issue https://github.com/tidymodels/parsnip/issues/195, seems that the glmnet will automatically identify the optimal lambda. So I'm wonder if I need to tune the penalty?

juliasilge commented 3 years ago

You can read more about the underlying glmnet model here; notice that it does fit a whole range of regularization penalties when you fit() to data, but you need to use something like cross-validation to pick the best value. So yes ✅ you do need to tune the regularization penalty for glmnet.

mkrasmus commented 3 years ago

Hi Julia. Thank you for this example it is helping me a lot. I have been looking into the benefits of the relaxed lasso approach and was wondering if you think it is appropriate to maintain a tidymodels workflow, with the first lasso identifying predictors to be removed, for the second lasso's recipe? So effectively just updating the recipe for the workflow and repeating the fitting with another lasso...

juliasilge commented 3 years ago

@mkrasmus I have not worked through this myself, but I do think this would work. I'm more familiar with this definition of relaxed lasso, where you use lasso to do feature selection and then refit those features with no penalization (so penalty = 0 in tidymodels). I'd be super interested in seeing how you set this up!

mandarpriya commented 3 years ago

Hi Maam, I was trying to replicate the code,and encountered the problem in rf_res, Error: The resamples argument should be an 'rset' object, such as the type produced by vfold_cv() or other 'rsample' functions. In addition: Warning message: The ... are not used in this function but one or more objects were passed:

So like do i need to change the argument or something else is an issue.

juliasilge commented 3 years ago

@mandarpriya Hmmmm, this blog post does not create anything called rf_res; maybe you are looking at the wrong post?

scottlyden commented 3 years ago

Hi, Julia. I'm in the super early stages of trying to grasp tidymodels, and this post feels like a good introduction. Thanks for all the work you do on tidymodels and on your blog.

The block of code that plots the lasso_grid metrics crashes my RStudio session reliably. I don't see anything strange about that code snippet, but I am having trouble diagnosing the problem since the session just crashes.

Do you spot anything that might have changed syntactically since you posted this?

Thanks for any help, and my sessionInfo is below ...

sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] glmnet_4.1-1 Matrix_1.2-18 forcats_0.5.1 stringr_1.4.0 readr_1.4.0
[6] tidyverse_1.3.1 yardstick_0.0.8 workflowsets_0.0.2 workflows_0.2.2 tune_0.1.4
[11] tidyr_1.1.3 tibble_3.1.0 rsample_0.0.9 recipes_0.1.16 purrr_0.3.4
[16] parsnip_0.1.5 modeldata_0.1.0 infer_0.5.4 ggplot2_3.3.3 dplyr_1.0.5
[21] dials_0.0.9 scales_1.1.1 broom_0.7.6 tidymodels_0.1.3

loaded via a namespace (and not attached): [1] fs_1.5.0 lubridate_1.7.10 doParallel_1.0.16 DiceDesign_1.9 httr_1.4.2
[6] tools_4.0.3 backports_1.2.1 utf8_1.2.1 R6_2.5.0 rpart_4.1-15
[11] DBI_1.1.1 colorspace_2.0-1 nnet_7.3-14 withr_2.4.2 tidyselect_1.1.0
[16] curl_4.3 compiler_4.0.3 cli_2.5.0 rvest_1.0.0 xml2_1.3.2
[21] labeling_0.4.2 digest_0.6.27 pkgconfig_2.0.3 parallelly_1.24.0 lhs_1.1.1
[26] dbplyr_2.1.1 rlang_0.4.10 readxl_1.3.1 rstudioapi_0.13 shape_1.4.5
[31] farver_2.1.0 generics_0.1.0 jsonlite_1.7.2 magrittr_2.0.1 Rcpp_1.0.6
[36] munsell_0.5.0 fansi_0.4.2 GPfit_1.0-8 lifecycle_1.0.0 furrr_0.2.2
[41] stringi_1.5.3 pROC_1.17.0.1 snakecase_0.11.0 MASS_7.3-53 plyr_1.8.6
[46] grid_4.0.3 parallel_4.0.3 listenv_0.8.0 crayon_1.4.1 lattice_0.20-41
[51] haven_2.4.0 splines_4.0.3 hms_1.0.0 pillar_1.6.0 codetools_0.2-16
[56] reprex_2.0.0 glue_1.4.2 modelr_0.1.8 vctrs_0.3.7 foreach_1.5.1
[61] schrute_0.2.2 cellranger_1.1.0 gtable_0.3.0 future_1.21.0 assertthat_0.2.1
[66] gower_0.2.2 janitor_2.1.0 prodlim_2019.11.13 class_7.3-17 survival_3.2-7
[71] timeDate_3043.102 iterators_1.0.13 hardhat_0.1.5 lava_1.6.9 globals_0.14.0
[76] ellipsis_0.3.1 ipred_0.9-11

juliasilge commented 3 years ago

@scottlyden Wow, that sounds super frustrating. 😩 I would take a look at collect_metrics(lasso_grid) to see if anything looks strange there, and also try autoplot(lasso_grid) to see if that crashes. It looks like your R is pretty up-to-date, so I assume your RStudio is as well? I just re-ran the code from this blog post and it did all run without crashing (although there are some differences now); I certainly would not expect anything to cause R to crash, especially when plotting. I might try running the chunk that crashes line-by-line (first line 1, then line 1 + 2, then line 1 - 3, etc) to see what exactly is causing the crash?

rpgately commented 3 years ago

Hi Julia,

Thanks so much for all the tutorials/blogs/youtube videos - I think they are such a great source of information and learning. I am doing a PhD in medicine and with my dataset I am trying to predict an outcome (rare disease) using LASSO and following your code above (in addition to your blog on Himalayas and class imbalance). As I understand it feature selection is an intrinsice part of LASSO - is there any way of easily extracting what variables actually are selected? Presumably any variable that is included in your VIP graph above is selected by LASSO? It's a bit difficult to interpret though as each row of the VIP graph seems to be a a level within a variable (e.g. character Kim, Michael etc.)

Also just wondering is there any easy way to alter the metrics that are given with the "collect metrics" function, for some reason accuracy and ROC_AUC are generated when I use my data instead of the RMSE that you got above, perhaps due to fact I am using logsitic regression instead of linear?

Again thanks so much, enjoy the coffee!!!

R

juliasilge commented 3 years ago

@rpgately Yes, you can definitely extract the variables that the lasso algorithm selected. This chapter walks through this pretty thoroughly; look for the part where you tidy() the lasso model output.

You can set any of a multitude of metrics that are appropriate for your model, using metric_set(). You can read here about using metrics to judge model effectiveness and here about setting metrics during tuning or fitting.

kamaulindhardt commented 3 years ago

Hi Julia,

Thank you for this detailed LASSO regression modelling. How would you recommend comparing model performance for an exercise where I have: 1) a linear (glmnet) model, 2) a LASSO linear model, 3) a random forest model and 4) a XGBoost model all predicting logRR (response ratios) from a large dataset with more than 40 variables and 10000 rows/observations. I found something here: https://www.kirenz.com/post/2021-02-17-r-classification-tidymodels/#compare-models but he does not really explain what metrics are most appropriate to use (especially when I also have "simpler" linear models), or what visualisation options are possible when using tidy.. What metrics would you use?

Any good tips and suggestions are welcomed. Thank you!

rpgately commented 3 years ago

Thanks so much Julia that's a great help!

juliasilge commented 3 years ago

@kamaulindhardt I'm not totally sure what kind of model analysis you have here, since you say you are after a response ratio, which I don't think naturally falls out of a tree-based model (I think you would need a partial dependence plot or similar?). However, all that aside, the thing to do is to choose metrics appropriate to your problem, either for regression or classification. It doesn't matter that some of your models are linear and some are more complex; you can compare them using the same metrics.

kamaulindhardt commented 3 years ago

Hi again Julia,

Is there a way to plot the LASSO predictors vs. the actual outcome values? Like we have for glmnet linear models?

Cheers,

juliasilge commented 3 years ago

@kamaulindhardt If there are particular plots from glmnet you want to create, you can use pull_workflow_fit() to get the parsnip fit, then access the $fit object and call those glmnet functions on it.

kamaulindhardt commented 3 years ago

@kamaulindhardt I'm not totally sure what kind of model analysis you have here, since you say you are after a response ratio, which I don't think naturally falls out of a tree-based model (I think you would need a partial dependence plot or similar?). However, all that aside, the thing to do is to choose metrics appropriate to your problem, either for regression or classification. It doesn't matter that some of your models are linear and some are more complex; you can compare them using the same metrics.

Thanks for this reply (see above). Just to clarify; I am simply using one column “logRR” in the dataset as my outcome/dependent variable while all other variables are predictors in my regression. So it’s not that I expect RR or logRR to come out of my tree-based model 😉 A partial dependence plot could be a good option maybe for each model.. but not to compare models I guess?

MaChenChen1211 commented 3 years ago

RE: @mandapriya, I also get an error when creating rf_res.

rf_res <- fit_resamples(
    weekly_attendance ~ .,
    rf_spec,
    nfl_folds,
    control = control_resamples(save_pred = TRUE)
 )
Error: The first argument to [fit_resamples()] should be either a model or workflow.
juliasilge commented 3 years ago

This blog post is not about random forest or NFL data, but the one that is is fairly old, and there was a change to tune a while back so that you should now put either a workflow or a model first. Hence the error message:

The first argument to [fit_resamples()] should be either a model or workflow.

The fix is to put your model or workflow as the first argument, like this:

rf_res <- fit_resamples(
  rf_spec,
  weekly_attendance ~ ., 
  nfl_folds,
  control = control_resamples(save_pred = TRUE)
)
ayueme commented 3 years ago

Hi, Julia, thanks for your tutorial. After get the rf_res by fit_resamples,how to get the results on nfl_test ?

juliasilge commented 3 years ago

@ayue2019 Are you maybe thinking of a different blog post? That one does show how to predict on the test.

MaChenChen1211 commented 3 years ago

I figure I messed up and not you :-)Thanks

ayueme commented 3 years ago

Hi, Julia, your link is still this post...

TidyTuesday and tidymodels (https://juliasilge.com/blog/intro-tidymodels/)

juliasilge commented 3 years ago

Oh, weird! I was looking at the comments on the GitHub backend and they are labeled wrong somehow. My apologies!

If you have a fitted model called rf_fit, you can get predictions on the test set via predict(rf_fit, new_data = nfl_test). A resampled object like rf_res does not have a fitted model to use for prediction. You might want to look into using last_fit() and collect_metrics() as shown in some other posts, like this one.

ayueme commented 3 years ago

Yeah! Thank you very much.

lilugarold commented 3 years ago

Hi Julia, Thanks for the tutorial. I am trying to replicate your example on my own data and am getting an error.

Here is the code snippet where I am facing an issue (train_tbl is the training data ; perc is the dependent variable with values between 0 and 1):

results_train <- lm_fit %>% predict(new_data = train_tbl) %>% mutate( truth =ts( train_tbl$perc), model = "lm" ) %>% bind_rows(rf_fit %>% predict(new_data = train_tbl) %>% mutate( truth = ts( train_tbl$perc), model = "rf" ))

I am getting the following error and was not able to identify the solution based on the stackoverflow search:

Error in UseMethod("mutate") : no applicable method for 'mutate' applied to an object of class "c('double', 'numeric')"

Any chance you can help with this?

juliasilge commented 2 years ago

@lilugarold It's hard to say without your data but I am guessing it might be related to using the ts() function there. Can you create a reprex demonstrating your problem and post this on RStudio Community? That's a good way to share a coding problem like this and get help.

duccioa commented 2 years ago

Hi Julia, Thanks for the tutorial. I am trying to replicate your example on my own data and am getting an error.

Here is the code snippet where I am facing an issue (train_tbl is the training data ; perc is the dependent variable with values between 0 and 1):

results_train <- lm_fit %>% predict(new_data = train_tbl) %>% mutate( truth =ts( train_tbl$perc), model = "lm" ) %>% bind_rows(rf_fit %>% predict(new_data = train_tbl) %>% mutate( truth = ts( train_tbl$perc), model = "rf" ))

I am getting the following error and was not able to identify the solution based on the stackoverflow search:

Error in UseMethod("mutate") : no applicable method for 'mutate' applied to an object of class "c('double', 'numeric')"

Any chance you can help with this?

@lilugarold I think you might have a problem in predict(new_data = train_tbl) %>% mutate(... The usual output of predict is a numeric vector, but mutate is a function to create a new column in a data.frame so it takes a data.frame and not a numeric. Here you are telling mutate to create new columns 'truth' and 'model' on a nuneric vector.

ursulahneumann commented 2 years ago

Hi Julia, thanks you for this tutorial. I'm wondering if there is a way to view the dataframe after all the steps have been performed in the recipe (I want to check that the functions did what I was expecting them to and that I didn't make a mistake).

juliasilge commented 2 years ago

@ursulahneumann Yes, what you want to do is prep() and then bake() the recipe. Check out these links to understand these lower-level functions better:

hail2thief commented 2 years ago

Thanks Julia for this excellent resource! I was wondering about the pre-processing step where you standardize all the numeric predictors before fitting the lasso. It was my understanding that glmnet standardizes the predictors by default. Is there a risk here that the predictors are being standardized twice (once in the pre-processing step, and then once again when they go through the lasso engine)?

juliasilge commented 2 years ago

@hail2thief You can read more about using glmnet with tidymodels here, but the short answer is that we don't change that default (standardize = TRUE) but that it does not do any harm to standardize already-standardized variables.

hail2thief commented 2 years ago

Thanks for the quick response! So then the coefficient estimates in this example are still in the standardized scale, right? (glmnet spits out coefficents in their "original" scale, but the original scale here is standardized)

juliasilge commented 2 years ago

@hail2thief Yep, that's correct. 👍

srgillott commented 2 years ago

Howdy!

First of all. Thank you for your work as you are saving my PhD!

But i have a question about the Variable Importance Plot. I have looked around at the package info for generating vip plots on linear models. It said it normally uses the t statisitc (https://koalaverse.github.io/vip/articles/vip.html) however, i see there is the lambda = lowest_rmse$pentalty call in the vi function. So what exactly does importance mean? Does it need the lamba call to recompute the t statisic?

juliasilge commented 2 years ago

@srgillott In this blog post that uses LASSO regression via glmnet, calling vip() uses the model-based variable importance, i.e. vi_model(). If you read that documentation you'll find:

Similar to (generalized) linear models, the absolute value of the coefficients are returned for a specific model. It is important that the features (and hence, the estimated coefficients) be standardized prior to fitting the model. You can specify which coefficients to return by passing the specific value of the penalty parameter via the ... argument. See coef.glmnet for details. By default, the coefficients corresponding to the final penalty value in the sequence is returned; in other words, you should ALWAYS SPECIFY THIS VALUE! For "cv.glmnet" objects, the largest value of lambda such that error is within one standard error of the minimum is used by default. For "multnet" objects, the coefficients corresponding to the first class are used; that is, the fist component of coef.glmnet.

GG-witt commented 2 years ago

i have problem when try to estimate ridge regress using tidymodel.

ridge_recipe = recipe(data=ames_train, log(Sale_Price)~.)
ridge_model = linear_reg(penalty=0.1,mixture=0)|>set_engine("glmnet")
ridge_wf = work_flow() |> add_recipe(ridge_recipe) |> add_model(ridge_model)
ridge_fit = ridge_wf |> fit(data=ames_train)

when try to fit, its not work

> ridge_fit = ridge_wf |> fit(data=ames_train)
> Error in app$vspace(new_style$`margin-top`%||% 0`) : 
attemp to apply non-function

i dont know what wrong. any one please help thanks

juliasilge commented 2 years ago

It looks like ridge regression works fine:

library(tidymodels)
data(ames)

ames <- ames |> mutate(Sale_Price = log10(Sale_Price))

set.seed(502)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

ridge_model <- linear_reg(penalty = 0.1, mixture = 0) |> set_engine("glmnet")
ridge_wf <- 
  workflow() |> 
  add_model(ridge_model) |>
  add_variables(outcome = Sale_Price, predictors = c(Longitude, Latitude))

ridge_fit <- ridge_wf |> fit(data = ames_train)
extract_fit_engine(ridge_fit) |> print()
#> 
#> Call:  glmnet::glmnet(x = maybe_matrix(x), y = y, family = "gaussian",      alpha = ~0) 
#> 
#>     Df  %Dev Lambda
#> 1    2  0.00 52.200
#> 2    2  0.12 47.560
#> 3    2  0.13 43.340
#> 4    2  0.15 39.490
#> 5    2  0.16 35.980
#> 6    2  0.17 32.780
#> 7    2  0.19 29.870
#> 8    2  0.21 27.220
#> 9    2  0.23 24.800
#> 10   2  0.25 22.590
#> 11   2  0.28 20.590
#> 12   2  0.30 18.760
#> 13   2  0.33 17.090
#> 14   2  0.36 15.570
#> 15   2  0.40 14.190
#> 16   2  0.44 12.930
#> 17   2  0.48 11.780
#> 18   2  0.52 10.730
#> 19   2  0.57  9.781
#> 20   2  0.63  8.912
#> 21   2  0.69  8.120
#> 22   2  0.75  7.399
#> 23   2  0.82  6.742
#> 24   2  0.90  6.143
#> 25   2  0.99  5.597
#> 26   2  1.08  5.100
#> 27   2  1.18  4.647
#> 28   2  1.28  4.234
#> 29   2  1.40  3.858
#> 30   2  1.53  3.515
#> 31   2  1.67  3.203
#> 32   2  1.82  2.918
#> 33   2  1.98  2.659
#> 34   2  2.15  2.423
#> 35   2  2.34  2.208
#> 36   2  2.54  2.011
#> 37   2  2.76  1.833
#> 38   2  2.99  1.670
#> 39   2  3.23  1.522
#> 40   2  3.50  1.386
#> 41   2  3.78  1.263
#> 42   2  4.08  1.151
#> 43   2  4.40  1.049
#> 44   2  4.73  0.956
#> 45   2  5.08  0.871
#> 46   2  5.45  0.793
#> 47   2  5.83  0.723
#> 48   2  6.23  0.659
#> 49   2  6.65  0.600
#> 50   2  7.07  0.547
#> 51   2  7.51  0.498
#> 52   2  7.96  0.454
#> 53   2  8.42  0.414
#> 54   2  8.88  0.377
#> 55   2  9.34  0.343
#> 56   2  9.81  0.313
#> 57   2 10.27  0.285
#> 58   2 10.72  0.260
#> 59   2 11.17  0.237
#> 60   2 11.61  0.216
#> 61   2 12.03  0.196
#> 62   2 12.44  0.179
#> 63   2 12.83  0.163
#> 64   2 13.20  0.149
#> 65   2 13.55  0.136
#> 66   2 13.88  0.123
#> 67   2 14.19  0.112
#> 68   2 14.48  0.102
#> 69   2 14.74  0.093
#> 70   2 14.98  0.085
#> 71   2 15.20  0.078
#> 72   2 15.40  0.071
#> 73   2 15.58  0.064
#> 74   2 15.74  0.059
#> 75   2 15.89  0.053
#> 76   2 16.01  0.049
#> 77   2 16.13  0.044
#> 78   2 16.22  0.040
#> 79   2 16.31  0.037
#> 80   2 16.39  0.034
#> 81   2 16.45  0.031
#> 82   2 16.51  0.028
#> 83   2 16.56  0.025
#> 84   2 16.60  0.023
#> 85   2 16.64  0.021
#> 86   2 16.67  0.019
#> 87   2 16.69  0.017
#> 88   2 16.72  0.016
#> 89   2 16.74  0.015
#> 90   2 16.75  0.013
#> 91   2 16.77  0.012
#> 92   2 16.78  0.011
#> 93   2 16.79  0.010
#> 94   2 16.80  0.009
#> 95   2 16.80  0.008
#> 96   2 16.81  0.008
#> 97   2 16.81  0.007
#> 98   2 16.82  0.006
#> 99   2 16.82  0.006
#> 100  2 16.82  0.005

Created on 2022-06-02 by the reprex package (v2.0.1)

Can you create a reprex (a minimal reproducible example) for your problem? The goal of a reprex is to make it easier for people to recreate your problem so that they can understand it and/or fix it. I recommend that you post your reprex on RStudio Community because it is a great forum for R programming problems.

Pascal-Schmidt commented 1 year ago

Hi Julia,

When you are tuning the model, does it fit all 25 splits for each hyperparameter? So does it fit 25*50=1250 models and then returns the mean rmse and rsq for each parameter in the grid of the 25 fits?

Thank you!

juliasilge commented 1 year ago

@Pascal-Schmidt Yep, that's right. If you use collect_metrics() like I do in this blog post, you will see the mean RMSE and R squared. If you want to get back the metrics for each fold, you can do collect_metrics(summarize = FALSE); that will return a dataframe with 2500 rows, two rows (rmse and rsq) for each model that was fit.

MCV97 commented 1 year ago

Hi Julia! Thank you for you amazing videos. Don't know what I would do without them. Is there any way to get p-values when extracting workflow fit? I am only getting the estimate.

PS: I am using ridge logistic regression.

This is what I write in R: HD_wf %>% finalize_workflow(ridge_roc) %>% fit(HD_train) %>% extract_fit_parsnip() %>% tidy()

juliasilge commented 1 year ago

@MCV97 As far as I am aware, glmnet does not provide p-values or do significance testing. You can read more about this topic, but it's not easy to say what a p-value means or should mean for LASSO or ridge regression, etc.

dzaltz commented 1 year ago

Thank you so much for this clear tutorial, Dr. Silge! I am wondering if you could direct me to some resources on replicating this with a multilevel model. Is it possible to include random effects with "glmnet," or is there another way go about lasso regression within a multilevel framework? Thank you!

juliasilge commented 1 year ago

@dzaltz I don't have practical experience myself in combining regularization with random effects, but I have seen folks use the glmLasso package for this. There isn't support for this yet in tidymodels but there is an issue open here; you should add your perspective and use case there so it can be prioritized.

dzaltz commented 1 year ago

@juliasilge thanks for the quick reply, I really appreciate it! I need to think through this more and may then provide comment on the tidymodels issue. for example, ignoring nesting and doing a simple linear regression on clustered data would still yield similar estimates but obviously different SEs, and i don't think the SEs of regularized models are very informative regardless...more reading on this....thanks again

Pancreatologist commented 1 year ago

Hi Julia, many thanks for your guidance. When we use last_fit() and use the data_split(including training and test, will it only valid on test? How can we make a ROC curve which compare with training and test in final model?I‘m a new learner for tidymodel, sry for any inconvenience. Thanks again.

juliasilge commented 1 year ago

@Pancreatologist Be sure to be very careful computing metrics (like ROC AUC) using the training data; you can read about the dangers of that here. We don't recommend comparing, say, ROC curves for training and testing sets as part of model evaluation. However, if it's really something you would want to do (like for teaching or demonstration), you would take an approach like this one.

Pancreatologist commented 10 months ago

Hi Julia, many thanks for your reply. In addtion, I also wonder that when I used 'vip' package to see which variables are most important, there were too much variables in this model. Is there any possiblities for me to filter the variables? Or it will be the model tuned by better or best penalty, I shouldnt filter subjectively. Thanks again.