juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

Tune XGBoost with tidymodels and #TidyTuesday beach volleyball | Julia Silge #9

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Tune XGBoost with tidymodels and #TidyTuesday beach volleyball | Julia Silge

Learn how to tune hyperparameters for an XGBoost classification model to predict wins and losses.

https://juliasilge.com/blog/xgboost-tune-volleyball/

Haoran-Jiang commented 3 years ago

Hi, Julia! Thank you so much to your wonderful tidymodelsseries. It is very informative and impressive. Nice job! For this XgBoost tuning blog, I found a wired result for the ROC curve part. Everything except ROC curve works well. I got the same accuracy and AUC like yours. But my ROC curve is flipped along with the diagonal. It is really wired. Since my curve is below the diagonal, the AUC should be less than 1/2 by definition. However, my AUC is the same as yours. Is it possible that something wrong with roc_curve function? The version of yardstick I am using is 0.0.7. Thank you in advance.

juliasilge commented 3 years ago

Yes, since I published this blog post, there was a change in yardstick (in version 0.0.7) that changed how to choose which level (win or lose) is the "event". You can change this by using the event_level argument for functions like roc_curve().

Haoran-Jiang commented 3 years ago

Got it. Thank you.

Mr-Hadoop-Hotshot commented 3 years ago

Hi Julie,

Great tutorial. Thank you for your support.

I am facing two problems;

  1. my code:
    final_res %>% 
    collect_predictions() %>%
    roc_curve(y,.pred_1,event_level="second") %>% 
    autoplot()

Error: The number of levels in truth (3) must match the number of columns supplied in ... (1).

  1. How do I deploy the model for real time data. As. in how can I run this model against other dataframe?

Appreciate your time. Thanks in advance.

juliasilge commented 3 years ago

@Mr-Hadoop-Hotshot it sounds like something has gone a bit wrong somewhere in predictions, maybe some NA values are being generated? I would look at the output of collect_predictions() and see what is happening there.

The output of last_fit() contains a workflow that you can use for prediction on new data. I show how to do that in this post and this post.

Mr-Hadoop-Hotshot commented 3 years ago

Hi Julia,

Thank you for your reply. Other tutorials also excellent as always. I found a solution for the second problem.

But, the 1st one remains the same.

a. Actually my original .df target variable had three levels (i.e: 1,2 & 3). I applied, filter( ) command to use only 1 & 3. b. Before doing initial_split( )I used the droplevels( )command and applied last_fit( )command. c. Strangely, when I applied conf_mat( ), no errors popped, but instead the "2" level was also present mentioning both "Actual" & "Predicted" values as 0. d. I suspect this is what stopping me from generating the roc curve. But, when I check levels of the variable and visual inspection it's no where to be found. e. collect_predictions( ) also, did return a column for .pred_2. Very confused!!!

Any suggestions on this? Note, all NAalso been addressed.

Appreciate your time. Thanks in advance.

juliasilge commented 3 years ago

@Mr-Hadoop-Hotshot Ah gotcha, I would go back to the very beginning and make sure that your initial data set only has two levels in your outcome; this sounds like however you are trying to filter and remove a level is not working. If you would like somewhere to ask for help, I recommend RStudio Community; be sure to create a reprex showing the problem.

Mr-Hadoop-Hotshot commented 3 years ago

Hi @juliasilge

Yeah sure I tried that,. Just wanted to let you know, your blog is full of quality information. You have any materials related to sentiment analysis in R.

Thank you once again.

juliasilge commented 3 years ago

Check out this chapter of my book on text mining for info on sentiment analysis.

Mr-Hadoop-Hotshot commented 3 years ago

Hey, this book was recommended by UT, Austin when I was doing my PG program in data science and business analytics. Great book!. I used it as my reference source to by research. However, what are your thoughts on using sentimentr package directly on customer feedback kind of scenario rather than using the general procedures on NLP and comparing to the sentiment lexicons as mentioned in the book. I know it requires lot of effort and time of yours to make a video, but It would be great to learn techniques on NLP from your videos. Thanks.

jderazoa commented 3 years ago

Hola Julia, muchas gracias por compartir tu trabajo esta muy bueno soy seguidor tuyo me gusta mucho las pausas que tienes al explicar cada detalle de los codigos, excelente eres muy guapa

graco-roza commented 3 years ago

Thanks for the tutorial! I wonder why we create vb_test if we never use it. Am I missing something?

juliasilge commented 3 years ago

@graco-roza I think I discuss this in the video, but the main idea there is to demonstrate how to prep() and bake() a recipe for debugging and problem-solving purposes. If you use a workflow() you don't technically need those steps, but it can be helpful to know what is going on under the hood and be able to trouble shoot if/when things go wrong.

Mr-Hadoop-Hotshot commented 3 years ago

Hi @juliasilge Hope you are doing well and safe!

Hey I recently started to encounter a problem with executing predict(fnal$.workflw[[1]],my_dataframe[,]) code line after upgrading my Rstudio from 4.0.4 version to 4.1.0.

ERROR MESSAGE : R Session Aborted. R encountered a fatal error.

Tried running that code line in console window directly and R throws the same error back.

Any suggestions on this issue?

Appreciate your time. Thanks in advance.

juliasilge commented 3 years ago

@Mr-Hadoop-Hotshot Hmmm, most things are working well on R 4.1.0 but we have run into a few small issues so far that we've needed to fix. I can't tell from just this what the problem might be. Can you create a reprex and post it with the details of your problem on RStudio Community? I think that will be the best way to find the solution.

canlikala commented 3 years ago

Hey Julia, thank you very much for amazing work! I am a fresh Big data student, I want use these codes in my project , however I already split, and balanced my data for other models I did. For the purpose of the project I want to continue with the same split.

Is there any way I can put my prepared data in that split functions? I also did my Random Forest model with your codes, but now I dont know I can use my validation data for both models. Can you please give me help :)

juliasilge commented 3 years ago

@canlikala Yes, you can use existing training/testing splits in tidymodels; you will need to create your own split object manually, as shown here and in the links in that issue. If you have, say, existing training, validation, and testing data sets, you can definitely use them across multiple types of models.

This case study shows how we treat a validation set as a single set of resampling.

martinocrippa commented 3 years ago

Hi Julia, do you know if in parnship i can estimate an ensamble model with XGBOOST for regression but with a linear booster? thanks in advance have a nice day MC

juliasilge commented 3 years ago

@martinocrippa We don't currently make use of the linear booster in parsnip but we are tracking interest in that feature here. If you would like to either add a πŸ‘ or add any helpful context for your use case there, that would be great.

martinocrippa commented 3 years ago

ok, thank you very much have a nice day

Il giorno ven 18 giu 2021 alle ore 20:17 Julia Silge < @.***> ha scritto:

@martinocrippa https://github.com/martinocrippa We don't currently make use of the linear booster in parsnip but we are tracking interest in that feature here https://github.com/tidymodels/parsnip/issues/118. If you would like to either add a πŸ‘ or add any helpful context for your use case there, that would be great.

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/9#issuecomment-864202862, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHK6UAK4Z2KAZMTUDHTMS6DTTOETXANCNFSM4Y4KDWZA .

kamaulindhardt commented 3 years ago

Dear Julia,

I get this following error "Error: The provided grid is missing the following parameter columns that have been marked for tuning by tune(): 'trees'.", when using the grid_latin_hypercube function to tune my XGBoost grid for a regression exercise. I looked everywhere for an answer, not luck. Any idea? I think it has something to do with the "trees" definition..

kamaulindhardt commented 3 years ago

Sorry.. I found the reason: I forgot to set my Β΄trees = 1000Β΄ Nw it works. However I get this error in my XGBoost tuning

"Fold01, Repeat1: preprocessor 1/1, model 30/30: Error: The option counts = TRUE was used but parameter colsample_bynode was given as 0. Please use...

! Fold01, Repeat1: internal: A correlation computation is required, but estimate is constant and has 0 standard deviation, resulting in a divide by 0 ...

x Fold02, Repeat1: preprocessor 1/1, model 2/30: Error: The option counts = TRUE was used but parameter colsample_bynode was given as 0. Please use ..."

Anyone having experience with this?

kamaulindhardt commented 3 years ago

Thanks for this great example. I have a question.

In this example you are using XGBoost in a classification model and you naturally evaluate model performance in the end with a ROC curve.

My question is: What kind of model performance would you use for the case where XGBoost is used in regression?

juliasilge commented 3 years ago

@kamaulindhardt You can check out metrics that are appropriate for regression, and see some example ways to evaluate regression models in this chapter.

TotorosForest commented 3 years ago

Dear Julia and all, I had a great help from this tutorial, and comments as well for managing all the errors i have been getting during the analysis.

I have one problem which i could not solve, namely i need to get variable importance values. I need them to be in exact numbers and not only in the plot.

can you please be so kind and guide me in this issue?

Kind regards Tamara

juliasilge commented 3 years ago

@TotorosForest You can use the vip::vi() function for that.

TotorosForest commented 3 years ago

Dear Julia! Thank you so much! I think i have managed to solve the problem based on your comment

mm_final_xgb %>% fit(data = df_mm_train) %>% pull_workflow_fit() %>% vip::vi()

i hope i have not written "hubble bublle" code :)

My goal is to select some variables from 10 variables that are examined (8 variables are ordinal, 2 variables are binary). What would you recommend as a cutoff coefficient in case you would want to select only few of these 10?

Moreover, what is this importance value? Is it information gain value, gini idex? regression coefficients? How would i call them in the report?

Thank you.

juliasilge commented 3 years ago

@TotorosForest You can look here at the vip::vi() documentation to see how the importance scoring works for various models. I think a cutoff decision would be very domain and data specific. Good luck!

TotorosForest commented 3 years ago

Dear all, I have one more question about this part of the tutorial:

"It’s time to go back to the testing set! Let’s use last_fit() to fit our model one last time on the training data and evaluate our model one last time on the testing set. Notice that this is the first time we have used the testing data during this whole modeling analysis.

final_res <- last_fit(final_xgb, vb_split)"

My question: as we aim is to test the results in the testing set, should not the data file be "vb_test" instead of "vb_split"?

As i understand vb_split is the result of initial partition of the data 75 % / 25 %. and if we want to test on the test set, should not we choose "vb_test" ?

Thank you for understanding of my confusion.

Kind regards, Tamara

juliasilge commented 3 years ago

@TotorosForest You can check out the documentation for last_fit(); notice that it takes the split as the argument so that it can train one final time on the training data and evaluate on the testing data. You don't want to fit to the testing data.

TotorosForest commented 3 years ago

Thank you. From your answer i also learn how to read libraries and understand the arguments. As i am very unsure in these analyses, your comments are helping me greatly.

One more question. As you understand, i have used your tutorial for doing my analysis. This analysis is one part of the manuscript i am working recently. Therefore, of course, i would like to refer to your tutorial as a source of information. I can not use the webpage in my reference list. Thus, i wonder, is there some other sources of the very same information e.g. a pdf file, a report, or something like this that i could use?

I think you have done a great job with this tutorial and you should be refereed or acknowledged in the manuscript.

Regards Tamara

juliasilge commented 3 years ago

@TotorosForest Eventually the best reference for this kind of thing will be Tidy Modeling with R; that book is currently being finished up and we are still working on publisher details, though. Maybe you could use it as a reference as in process?

TotorosForest commented 3 years ago

Thanks! I will use the book.

datarichard commented 2 years ago

Thank you for this helpful tutorial. I'm trying to use it to build an analysis on an unbalanced dataset which requires an upsampling (e.g., step_upsample(...)) in a recipe step. However when I use a recipe() call rather than add_formula() to your code, the tuning step fails. e.g., inserting the recipe call here:

xgb_wf <- workflow() %>%
  add_recipe(recipe(win ~ ., data = vb_train)) %>%
  # add_formula(win ~ .) %>%
  add_model(xgb_spec)

══ Workflow ═════════ Preprocessor: Recipe Model: boost_tree()

── Preprocessor ──────── 0 Recipe Steps

── Model ─────────── Boosted Tree Model Specification (classification)

Main Arguments: mtry = tune() trees = 1000 min_n = tune() tree_depth = tune() learn_rate = tune() loss_reduction = tune() sample_size = tune()

Computational engine: xgboost

But then at the tune_grid step I get an error:

xgb_res <- tune_grid(
  xgb_wf,
  resamples = vb_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)

Fold10: preprocessor 1/1, model 30/30: Error in xgboost::xgb.DMatrix(x, label = y, missing = NA): 'data' has class 'character' and length 193500.

Do you have any hints on what I can do to fix it? I just need to upsample the low frequency target class during training...

All the best and thanks again,

Rich

juliasilge commented 2 years ago

@datarichard When you switch from a formula to a recipe, you'll need to take a little more control and specify the data preprocessing steps that the R formula does automatically for you, such as creating dummy/indicator variables from nominal data, like gender and circuit. xgboost models need all numeric input. You can read a bit more about these issues here and here.

datarichard commented 2 years ago

Thanks Julia. It wasn't immediately obvious to me why formula would automagically add dummy variables, but now I know!

The code solution for anyone else with a similar problem is something like this:

vb_recipe <- recipe(win ~ ., data = vb_train) %>%
  step_dummy(circuit, gender)

xgb_wf <- workflow() %>%
  add_recipe(vb_recipe) %>%
  # add_formula(win ~ .) %>%
  add_model(xgb_spec)

It gave me similar although not identical results as you report here, despite using the same random seed settings.

TotorosForest commented 2 years ago

Dear all, How can we calculate sensitivity and specificity of the models instead of AUC values?

Thank you for the answer. Regards,

juliasilge commented 2 years ago

@TotorosForest you can use metric_set() to choose the metrics you want to use, as shown here and here.

akilelkamel commented 2 years ago

Hi Julia, Thank you for this good job. Do you think that the example you handled and the model trained here enter under the target leakage problem. Because if we want to predict who will win a match, we don't have yet the match statistics.

Regards Akil.

juliasilge commented 2 years ago

@akilelkamel I absolutely see your point; if the goal here is purely predictive and our imaginary situation is predicting before the match, then we could not use any information from the match itself. If instead we are thinking of this model as descriptive or having another goal, then we might want to use information from the match. You might want to check out this section on types of models where we explore this type of taxonomy.

nvelden commented 2 years ago

Is there a way to integrate early stopping? That might save a lot of time in the tuning process...

juliasilge commented 2 years ago

@nvelden Yes, check out this post.

Lauravhc commented 2 years ago

Hi, Julia! Thank you so much to your wonderful job!! I am working on my own dataset, and everything goes fine until the last_fit function. When I get to this part:

final_res <- last_fit(final_xgb, vb_split)"

It tells me the error:

Error in summary.connection(connection) : invalid connection

I have been searching why could it be wrong, but I havent find anything. Do you know what could be wrong?

juliasilge commented 2 years ago

@Lauravhc Wow, that's weird; I've never seen that error. It looks like it is related to parallel workers getting confused; I would try out some of the solutions in that SO question and the links there. You can run your whole script sequentially (without parallel processing), right? If so, something about how you have your parallel processing set up isn't quite right.

adithirgis commented 2 years ago

Hello from a fan!

Thanks so so much for this. Are there tutorials for doing regression with xgboost? I am working with a regression problem and this approach does not seem to work I get this - A correlation computation is required, but estimate is constant and has 0 standard deviation, resulting in a divide by 0 error. NA will be returned. I have 4 numeric columns and one numeric for prediction.

juliasilge commented 2 years ago

@adithirgis You get this error when the model predicts a single value for all samples. Like Max says in that issue:

Two examples could be a regularized model that eliminates all predictors except the intercept and a CART tree that contains no splits.

So it is typically a sign that your model is not going so well! With only four numeric columns, you probably want to try a simpler algorithm than xgboost. Maybe start with tuning a decision tree, then see if a bagged tree helps?

adithirgis commented 2 years ago

Thank you so much! Just so that I understand it right, the number of independent variable (ie - 4) is very less for xgboost. And also I should try another model.

Also are there any similar tutorials for regression in xgboost?

Thanks & Regards Adithi

juliasilge commented 2 years ago

@adithirgis I don't think I have an extensive example of xgboost for regression here on my blog, but you can see a shorter example of how to fit and predict for regression with xgboost here.

adithirgis commented 2 years ago

Thanks again! I kind of thought that I could use set_mode("regression") and use your code. My bad. A little new to modelling :)

SimonMontfort commented 2 years ago

I tried last_fit(final_xgb, vb_split, metric = "sens") but collect_metrics(final_res) still only shows me accuracy` androc_auc``.

 .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.840 Preprocessor1_Model1
2 roc_auc  binary         0.928 Preprocessor1_Model1

Why might that be?