Open utterances-bot opened 3 years ago
Hi, Julia! Thank you so much to your wonderful tidymodelsseries. It is very informative and impressive. Nice job! For this XgBoost tuning blog, I found a wired result for the ROC curve part. Everything except ROC curve works well. I got the same accuracy and AUC like yours. But my ROC curve is flipped along with the diagonal. It is really wired. Since my curve is below the diagonal, the AUC should be less than 1/2 by definition. However, my AUC is the same as yours. Is it possible that something wrong with roc_curve function? The version of yardstick I am using is 0.0.7. Thank you in advance.
Yes, since I published this blog post, there was a change in yardstick (in version 0.0.7) that changed how to choose which level (win or lose) is the "event". You can change this by using the event_level
argument for functions like roc_curve()
.
Got it. Thank you.
Hi Julie,
Great tutorial. Thank you for your support.
I am facing two problems;
final_res %>%
collect_predictions() %>%
roc_curve(y,.pred_1,event_level="second") %>%
autoplot()
Error: The number of levels in truth
(3) must match the number of columns supplied in ...
(1).
Appreciate your time. Thanks in advance.
@Mr-Hadoop-Hotshot it sounds like something has gone a bit wrong somewhere in predictions, maybe some NA
values are being generated? I would look at the output of collect_predictions()
and see what is happening there.
The output of last_fit()
contains a workflow that you can use for prediction on new data. I show how to do that in this post and this post.
Hi Julia,
Thank you for your reply. Other tutorials also excellent as always. I found a solution for the second problem.
But, the 1st one remains the same.
a. Actually my original .df
target variable had three levels (i.e: 1,2 & 3). I applied, filter( )
command to use only 1 & 3.
b. Before doing initial_split( )
I used the droplevels( )
command and applied last_fit( )
command.
c. Strangely, when I applied conf_mat( )
, no errors popped, but instead the "2" level was also present mentioning both "Actual" & "Predicted" values as 0.
d. I suspect this is what stopping me from generating the roc curve. But, when I check levels of the variable and visual inspection it's no where to be found.
e. collect_predictions( )
also, did return a column for .pred_2
. Very confused!!!
Any suggestions on this? Note, all NA
also been addressed.
Appreciate your time. Thanks in advance.
@Mr-Hadoop-Hotshot Ah gotcha, I would go back to the very beginning and make sure that your initial data set only has two levels in your outcome; this sounds like however you are trying to filter and remove a level is not working. If you would like somewhere to ask for help, I recommend RStudio Community; be sure to create a reprex showing the problem.
Hi @juliasilge
Yeah sure I tried that,. Just wanted to let you know, your blog is full of quality information. You have any materials related to sentiment analysis in R.
Thank you once again.
Check out this chapter of my book on text mining for info on sentiment analysis.
Hey, this book was recommended by UT, Austin when I was doing my PG program in data science and business analytics.
Great book!. I used it as my reference source to by research. However, what are your thoughts on using sentimentr
package directly on customer feedback kind of scenario rather than using the general procedures on NLP and comparing to the sentiment lexicons as mentioned in the book. I know it requires lot of effort and time of yours to make a video, but It would be great to learn techniques on NLP from your videos. Thanks.
Hola Julia, muchas gracias por compartir tu trabajo esta muy bueno soy seguidor tuyo me gusta mucho las pausas que tienes al explicar cada detalle de los codigos, excelente eres muy guapa
Thanks for the tutorial! I wonder why we create vb_test
if we never use it. Am I missing something?
@graco-roza I think I discuss this in the video, but the main idea there is to demonstrate how to prep()
and bake()
a recipe for debugging and problem-solving purposes. If you use a workflow()
you don't technically need those steps, but it can be helpful to know what is going on under the hood and be able to trouble shoot if/when things go wrong.
Hi @juliasilge Hope you are doing well and safe!
Hey I recently started to encounter a problem with executing predict(fnal$.workflw[[1]],my_dataframe[,])
code line after upgrading my Rstudio from 4.0.4 version to 4.1.0.
ERROR MESSAGE : R Session Aborted. R encountered a fatal error.
Tried running that code line in console window directly and R throws the same error back.
Any suggestions on this issue?
Appreciate your time. Thanks in advance.
@Mr-Hadoop-Hotshot Hmmm, most things are working well on R 4.1.0 but we have run into a few small issues so far that we've needed to fix. I can't tell from just this what the problem might be. Can you create a reprex and post it with the details of your problem on RStudio Community? I think that will be the best way to find the solution.
Hey Julia, thank you very much for amazing work! I am a fresh Big data student, I want use these codes in my project , however I already split, and balanced my data for other models I did. For the purpose of the project I want to continue with the same split.
Is there any way I can put my prepared data in that split functions? I also did my Random Forest model with your codes, but now I dont know I can use my validation data for both models. Can you please give me help :)
@canlikala Yes, you can use existing training/testing splits in tidymodels; you will need to create your own split
object manually, as shown here and in the links in that issue. If you have, say, existing training, validation, and testing data sets, you can definitely use them across multiple types of models.
This case study shows how we treat a validation set as a single set of resampling.
Hi Julia, do you know if in parnship i can estimate an ensamble model with XGBOOST for regression but with a linear booster? thanks in advance have a nice day MC
@martinocrippa We don't currently make use of the linear booster in parsnip but we are tracking interest in that feature here. If you would like to either add a π or add any helpful context for your use case there, that would be great.
ok, thank you very much have a nice day
Il giorno ven 18 giu 2021 alle ore 20:17 Julia Silge < @.***> ha scritto:
@martinocrippa https://github.com/martinocrippa We don't currently make use of the linear booster in parsnip but we are tracking interest in that feature here https://github.com/tidymodels/parsnip/issues/118. If you would like to either add a π or add any helpful context for your use case there, that would be great.
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/9#issuecomment-864202862, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHK6UAK4Z2KAZMTUDHTMS6DTTOETXANCNFSM4Y4KDWZA .
Dear Julia,
I get this following error "Error: The provided grid
is missing the following parameter columns that have been marked for tuning by tune()
: 'trees'.", when using the grid_latin_hypercube function to tune my XGBoost grid for a regression exercise. I looked everywhere for an answer, not luck. Any idea? I think it has something to do with the "trees" definition..
Sorry.. I found the reason: I forgot to set my Β΄trees = 1000Β΄ Nw it works. However I get this error in my XGBoost tuning
"Fold01, Repeat1: preprocessor 1/1, model 30/30: Error: The option counts = TRUE
was used but parameter colsample_bynode
was given as 0. Please use...
! Fold01, Repeat1: internal: A correlation computation is required, but estimate
is constant and has 0 standard deviation, resulting in a divide by 0 ...
x Fold02, Repeat1: preprocessor 1/1, model 2/30: Error: The option counts = TRUE
was used but parameter colsample_bynode
was given as 0. Please use ..."
Anyone having experience with this?
Thanks for this great example. I have a question.
In this example you are using XGBoost in a classification model and you naturally evaluate model performance in the end with a ROC curve.
My question is: What kind of model performance would you use for the case where XGBoost is used in regression?
@kamaulindhardt You can check out metrics that are appropriate for regression, and see some example ways to evaluate regression models in this chapter.
Dear Julia and all, I had a great help from this tutorial, and comments as well for managing all the errors i have been getting during the analysis.
I have one problem which i could not solve, namely i need to get variable importance values. I need them to be in exact numbers and not only in the plot.
can you please be so kind and guide me in this issue?
Kind regards Tamara
@TotorosForest You can use the vip::vi()
function for that.
Dear Julia! Thank you so much! I think i have managed to solve the problem based on your comment
mm_final_xgb %>% fit(data = df_mm_train) %>% pull_workflow_fit() %>% vip::vi()
i hope i have not written "hubble bublle" code :)
My goal is to select some variables from 10 variables that are examined (8 variables are ordinal, 2 variables are binary). What would you recommend as a cutoff coefficient in case you would want to select only few of these 10?
Moreover, what is this importance value? Is it information gain value, gini idex? regression coefficients? How would i call them in the report?
Thank you.
@TotorosForest You can look here at the vip::vi()
documentation to see how the importance scoring works for various models. I think a cutoff decision would be very domain and data specific. Good luck!
Dear all, I have one more question about this part of the tutorial:
"Itβs time to go back to the testing set! Letβs use last_fit() to fit our model one last time on the training data and evaluate our model one last time on the testing set. Notice that this is the first time we have used the testing data during this whole modeling analysis.
final_res <- last_fit(final_xgb, vb_split)"
My question: as we aim is to test the results in the testing set, should not the data file be "vb_test" instead of "vb_split"?
As i understand vb_split is the result of initial partition of the data 75 % / 25 %. and if we want to test on the test set, should not we choose "vb_test" ?
Thank you for understanding of my confusion.
Kind regards, Tamara
@TotorosForest You can check out the documentation for last_fit()
; notice that it takes the split as the argument so that it can train one final time on the training data and evaluate on the testing data. You don't want to fit to the testing data.
Thank you. From your answer i also learn how to read libraries and understand the arguments. As i am very unsure in these analyses, your comments are helping me greatly.
One more question. As you understand, i have used your tutorial for doing my analysis. This analysis is one part of the manuscript i am working recently. Therefore, of course, i would like to refer to your tutorial as a source of information. I can not use the webpage in my reference list. Thus, i wonder, is there some other sources of the very same information e.g. a pdf file, a report, or something like this that i could use?
I think you have done a great job with this tutorial and you should be refereed or acknowledged in the manuscript.
Regards Tamara
@TotorosForest Eventually the best reference for this kind of thing will be Tidy Modeling with R; that book is currently being finished up and we are still working on publisher details, though. Maybe you could use it as a reference as in process?
Thanks! I will use the book.
Thank you for this helpful tutorial. I'm trying to use it to build an analysis on an unbalanced dataset which requires an upsampling (e.g., step_upsample(...)) in a recipe step. However when I use a recipe() call rather than add_formula() to your code, the tuning step fails. e.g., inserting the recipe call here:
xgb_wf <- workflow() %>%
add_recipe(recipe(win ~ ., data = vb_train)) %>%
# add_formula(win ~ .) %>%
add_model(xgb_spec)
ββ Workflow βββββββββ Preprocessor: Recipe Model: boost_tree()
ββ Preprocessor ββββββββ 0 Recipe Steps
ββ Model βββββββββββ Boosted Tree Model Specification (classification)
Main Arguments: mtry = tune() trees = 1000 min_n = tune() tree_depth = tune() learn_rate = tune() loss_reduction = tune() sample_size = tune()
Computational engine: xgboost
But then at the tune_grid step I get an error:
xgb_res <- tune_grid(
xgb_wf,
resamples = vb_folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE)
)
Fold10: preprocessor 1/1, model 30/30: Error in xgboost::xgb.DMatrix(x, label = y, missing = NA): 'data' has class 'character' and length 193500.
Do you have any hints on what I can do to fix it? I just need to upsample the low frequency target class during training...
All the best and thanks again,
Rich
@datarichard When you switch from a formula to a recipe, you'll need to take a little more control and specify the data preprocessing steps that the R formula does automatically for you, such as creating dummy/indicator variables from nominal data, like gender
and circuit
. xgboost models need all numeric input. You can read a bit more about these issues here and here.
Thanks Julia. It wasn't immediately obvious to me why formula would automagically add dummy variables, but now I know!
The code solution for anyone else with a similar problem is something like this:
vb_recipe <- recipe(win ~ ., data = vb_train) %>%
step_dummy(circuit, gender)
xgb_wf <- workflow() %>%
add_recipe(vb_recipe) %>%
# add_formula(win ~ .) %>%
add_model(xgb_spec)
It gave me similar although not identical results as you report here, despite using the same random seed settings.
Dear all, How can we calculate sensitivity and specificity of the models instead of AUC values?
Thank you for the answer. Regards,
@TotorosForest you can use metric_set()
to choose the metrics you want to use, as shown here and here.
Hi Julia, Thank you for this good job. Do you think that the example you handled and the model trained here enter under the target leakage problem. Because if we want to predict who will win a match, we don't have yet the match statistics.
Regards Akil.
@akilelkamel I absolutely see your point; if the goal here is purely predictive and our imaginary situation is predicting before the match, then we could not use any information from the match itself. If instead we are thinking of this model as descriptive or having another goal, then we might want to use information from the match. You might want to check out this section on types of models where we explore this type of taxonomy.
Is there a way to integrate early stopping? That might save a lot of time in the tuning process...
@nvelden Yes, check out this post.
Hi, Julia! Thank you so much to your wonderful job!! I am working on my own dataset, and everything goes fine until the last_fit function. When I get to this part:
final_res <- last_fit(final_xgb, vb_split)"
It tells me the error:
Error in summary.connection(connection) : invalid connection
I have been searching why could it be wrong, but I havent find anything. Do you know what could be wrong?
@Lauravhc Wow, that's weird; I've never seen that error. It looks like it is related to parallel workers getting confused; I would try out some of the solutions in that SO question and the links there. You can run your whole script sequentially (without parallel processing), right? If so, something about how you have your parallel processing set up isn't quite right.
Hello from a fan!
Thanks so so much for this. Are there tutorials for doing regression with xgboost? I am working with a regression problem and this approach does not seem to work I get this - A correlation computation is required, but estimate is constant and has 0 standard deviation, resulting in a divide by 0 error. NA will be returned. I have 4 numeric columns and one numeric for prediction.
@adithirgis You get this error when the model predicts a single value for all samples. Like Max says in that issue:
Two examples could be a regularized model that eliminates all predictors except the intercept and a CART tree that contains no splits.
So it is typically a sign that your model is not going so well! With only four numeric columns, you probably want to try a simpler algorithm than xgboost. Maybe start with tuning a decision tree, then see if a bagged tree helps?
Thank you so much! Just so that I understand it right, the number of independent variable (ie - 4) is very less for xgboost. And also I should try another model.
Also are there any similar tutorials for regression in xgboost?
Thanks & Regards Adithi
@adithirgis I don't think I have an extensive example of xgboost for regression here on my blog, but you can see a shorter example of how to fit and predict for regression with xgboost here.
Thanks again! I kind of thought that I could use set_mode("regression")
and use your code. My bad. A little new to modelling :)
I tried last_fit(final_xgb, vb_split, metric = "sens")
but collect_metrics(final_res)
still only shows me accuracy` and
roc_auc``.
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.840 Preprocessor1_Model1
2 roc_auc binary 0.928 Preprocessor1_Model1
Why might that be?
Tune XGBoost with tidymodels and #TidyTuesday beach volleyball | Julia Silge
Learn how to tune hyperparameters for an XGBoost classification model to predict wins and losses.
https://juliasilge.com/blog/xgboost-tune-volleyball/