juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

Tune XGBoost with tidymodels and #TidyTuesday beach volleyball | Julia Silge #9

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Tune XGBoost with tidymodels and #TidyTuesday beach volleyball | Julia Silge

Learn how to tune hyperparameters for an XGBoost classification model to predict wins and losses.

https://juliasilge.com/blog/xgboost-tune-volleyball/

SimonMontfort commented 2 years ago

How can I get the sensitivity after last_fit?

juliasilge commented 2 years ago

@SimonMontfort You can pass a metric_set() to last_fit() with whatever metrics you need.

nvelden commented 2 years ago

What does the finalize() function do?

xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), vb_train),
  learn_rate(),
  size = 30
)

xgb_grid
juliasilge commented 2 years ago

@nvelden You can read more about that here, but briefly, it determines what appropriate values are for, say, mtry based on your real data.

stepbystepdatascience commented 2 years ago

Hello, thank you for the tutorial! Am I right in thinking that if you had a recipe step that created more features e.g. a step_dummy() the finalize(mtry(), vb_train) would need to be something like finalize(mtry(), juice(my_recipe %>% prep())) to capture the extra columns created by the recipe?

juliasilge commented 2 years ago

@stepbystepdatascience Yes, you can do something like that, or use the functions in tidymodels that support finding the right value; these may be easier to use or help you avoid making mistakes. You can also pass in something like grid = 30 to tune_grid(), which will then automatically handle finding good/correct ranges for parameters like mtry.

izzydi commented 2 years ago

Hi Julia.

Thank you very much for your videos and all the codes ! I really love them and im learning a lot from them!

Can i ask you a question plz ? I am using this code:

xgb_spec <-
  boost_tree(

     trees = 300,
     mtry = 2,

  ) %>%
  set_engine("xgboost", 
              nrounds = 20,
              max_depth = 2,
              gamma = 1) %>%
  set_mode("classification")

# workflow
xgb_wf <- 
  workflow() %>%
  add_recipe(rcp) %>%
  add_model(xgb_spec)

and when i fit the model i get this warning:

Warning: The following arguments cannot be manually modified and were removed: max_depth, nrounds, gamma.

Is there any way to set these arguments manually ?

Thank you!

juliasilge commented 2 years ago

In parsnip we harmonize model argument names; see the examples here from ranger. It turns out that those names like nrounds and max_depth from xgboost mean the same thing as trees and tree_depth in parsnip, so you can't put them in twice. How were you supposed to know that??? It turns out that we used to have tables in our documentation that showed this for models but we somehow removed them at some point; I've opened an issue for us to add those back in. In the meantime, if you run into something like this, unfortunately for now the best place to look is the source code in parsnip, like this for boost_tree().

Akash-Ansari commented 2 years ago

It's nice tutorial. Btw, my Roc curve is totally upside down. Anyone having this issue?

juliasilge commented 2 years ago

@Akash-Ansari yes, check out this comment above.

Akash-Ansari commented 2 years ago

Thank you so much. It works now.

conlelevn commented 2 years ago

Hi Julia,

What is the different bwt sample_size() and min_n() actually?

juliasilge commented 2 years ago

@conlelevn You can read a brief description here, which we hope lays out their definitions and differences.

SamiFarashi commented 2 years ago

Hi Julia, Thanks for this great video, I am trying to tune a model for ~40potential predictor and I am not able to produce agb_res, the errors I am getting are: Warning messages: 1: In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed, : scheduled cores 1, 2, 3 did not deliver results, all values of the jobs will be affected 2: All models failed. See the .notes column.

nothing informative in 'notes' column!

Thanks, Sami

juliasilge commented 2 years ago

Can you create a reprex (a minimal reproducible example) for this? The goal of a reprex is to make it easier for us to recreate your problem so that we can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. Once you have a reprex, I recommend posting on RStudio Community, which is a great forum for getting help with these kinds of modeling questions. One thing you will want to probably do is start off without using parallel processing to more easily diagnose the problem. Thanks! 🙌

SamiFarashi commented 2 years ago

Hi Julia, Thanks for the reply. I think I figured out was was the problem (

SamiFarashi commented 2 years ago

I had many 0s in the data, it's running now but the tune_grid is taking so lung ~12 hours and still running, I am wondering if this is normal? Thanks again, Sami

juliasilge commented 2 years ago

@SamiFarashi I would say generally no, but it's hard to say without other information. If you are looking at a very long-running model, I recommend starting out with very few tuning parameters, few resamples or a subset of your data, and then scaling up to achieve the best model in a reasonable timeframe. If you can describe your situation in more detail, I recommend posting on RStudio Community, which is a great forum for getting help with these kinds of modeling questions.

dhillary-ias commented 1 year ago

Great post and package! Thanks so much!

MonkeyCousin commented 1 year ago

Hi Julia, excellent tutorial, thanks. I want to multi class classification; how is that possible, please?

juliasilge commented 1 year ago

Several of the models in tidymodels support multiclass classification! You can see some of them here, but also some models support this natively, like ranger.

MonkeyCousin commented 1 year ago

Thank you. Does that mean that xgboost as included in tidymodels does not support multi class classification? I have seen examples where num_class is set along with other params, e.g. with objective = "multi:softprob". I am keen to both continue my foray into tidymodels and, for consistency across my project, to use xgboost.

juliasilge commented 1 year ago

@MonkeyCousin xgboost does support multiclass, yep. You can see an example here.

wcwr commented 1 year ago

Hi Julia,

Thanks for this tutorial! When I run this with an XGBoost regression on my own data, everything works! However, the default model (setting trees=1000 and nothing more) performs slightly better than my tuned model!

Any idea if this is common? I'm wondering because I plan to implement this tuning step in many other areas of my code.

If relevant, I did choose the best parameters based on "rsq" rathern than "RMSE" (which seem to be the choices for a regression-based xgb compared to "auc" in the classification version".

juliasilge commented 1 year ago

@wcwr Take a look at this chapter to understand what might be happening by optimizing $R^2$ instead of RMSE. In general, I would be surprised if an untuned model with default parameters performed better than a model with tuned hyperparameters and I would double check that you're comparing models in a consistent way.

wcwr commented 1 year ago

Hi Julia,

In the tune_grid step, the resamples parameter was set to vb_folds, and the output of this final tuned model is xgb_res. Does this mean that the final model uses the hyperparameters that produced the best metric (AUC/RMSE/r) over the average of the 10 folds? Or could it be the single best fold? Or median perhaps?

Looked for this info in the tune_grid section of tune.tidymodels.org but I don't think I found it.

Thanks for the wonderful tutorial!

juliasilge commented 1 year ago

@wcwr The xgb_res object does not contain any final model. It contains the model performance results that you get across all the model configurations that were tried, estimated using the 10 folds. The next step is to choose the model you want (I did it here with select_best()) and then to train the model using that specific model configuration chosen via tuning on the whole training set with finalize_workflow() and last_fit(). You may want to read this "Getting Started" article on tidymodels.org.

jlecornu3 commented 1 year ago

Hi Julia,

Thanks for the blog post and all your videos!

How can you assess accuracy comparisons between train and test sets from the collect_metrics() call on the final fit?

juliasilge commented 1 year ago

@jlecornu3 We don't recommend measuring model performance using the training set as a whole for the reasons outlined in this section and there purposefully isn't fluent tooling in tidymodels to do so using a final tuned model. However, if you look at this blog post, the metrics you see with collect_metrics(xgb_res) are metrics computed using resamples of the training set; this is what we do recommend.

jlecornu3 commented 1 year ago

So do you feel this collect_metrics(xgb_res) is reflective of the true model performance on a test set...? Or would you advise computing accuracy / rmse / some other metric on both the resamples and the test set not using in resamples? If the latter... does tidymodels offer this?

juliasilge commented 1 year ago

@jlecornu3 Ah, maybe I misunderstood what you were asking. In this blog post:

You might want to check out this chapter on "spending your data budget" and how to use the training set vs. test set, as well as how last_fit() works.

jlecornu3 commented 1 year ago

Thanks Julia -- super clear!

mohamedelhilaltek commented 1 year ago

hi julia i know this is not the appropriate place to ask this question but i am trying to use mlflow in rstudio and i always faced this error and i did not find any solution : """ Error in process_initialize(self, private, command, args, stdin, stdout, …: ! Native call to processx_exec failed Caused by error in chain_call(c_processx_exec, command, c(command, args), pty, pty_options, …: ! Command 'C:/Users/TAKKOUK/AppData/Local/MICROS~1/WINDOW~1/Scripts/mlflow' not found @win/processx.c:982 (processx_exec) """

juliasilge commented 1 year ago

@mohamedelhilaltek I recommend that you create a reprex (a minimal reproducible example) showing what you want to do and any problems you run into with it, then posting on Posit Community. I know there aren't a ton of mlflow users but generally it's a great forum for getting help with these kinds of questions. Good luck! 🙌

Hamza-Gouaref commented 1 year ago

hi julia i have s regression problem where the target variable is influenced by zero more than 50 percent how can i do this with xgboost is there any step

juliasilge commented 1 year ago

@Hamza-Gouaref Hmmmm, if you had counts with a lot of zeroes, I would suggest that you use zero-inflated Poisson, like in this post. Can you formulate it as a Poisson problem? That would be my mine suggestion.

retzerjj commented 1 year ago

Hi Julia, thanks for the great info! Very useful! One question, I've been trying, unsuccessfully, to create a couple of partial dependence plots for your example (for both numeric and categorical predictors). I think its because I'm very unfamiliar with the tidyverse approach to predictive modeling and how/were objects are located. Could you direct me to a source that might be helpful (or short code example)? I've been trying to use pdp and the DALEXtra packages. Thanks very much, Joe

juliasilge commented 1 year ago

@retzerjj Check out this chapter of our book that shows how to make partial dependence plots with DALEXtra. If you are wanting to figure out how to pull out various components of a tidymodels workflow, check out these methods, which can help you extract out the workflow, the parsnip model, the underlying engine model, and so forth.

mjwera commented 11 months ago

Thank you for the great video and help. My question is about the vip package to see the variable importance. When I try to install the package I get the error message, "package 'vip' is not available for this version of R". I'm using 4.2.2. Has vip been placed by another package? Thanks.

juliasilge commented 11 months ago

@mjwera Ooooof, looks like it was archived from CRAN. You can read about their plans here and in the meantime you can install from GitHub.

bgreenwell commented 11 months ago

@mjwera apologies, looks like vip was orphaned for some failed tests from some of the last changes we made, but we never got the warning! Should be back up and running soon!

mjwera commented 11 months ago

Thank you!

On Fri, Aug 18, 2023 at 2:15 PM Julia Silge @.***> wrote:

@mjwera https://github.com/mjwera Ooooof, looks like it was archived from CRAN https://cran.r-project.org/package=vip. You can read about their plans here https://github.com/koalaverse/vip/issues/153 and in the meantime you can install from GitHub https://github.com/koalaverse/vip#installation.

— Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/9#issuecomment-1684336616, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZQHFBO5YFMKNNEPRVJP7YLXV65NFANCNFSM4Y4KDWZA . You are receiving this because you were mentioned.Message ID: @.***>