Open utterances-bot opened 3 years ago
How can I get the sensitivity after last_fit
?
@SimonMontfort You can pass a metric_set()
to last_fit()
with whatever metrics you need.
What does the finalize() function do?
xgb_grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), vb_train),
learn_rate(),
size = 30
)
xgb_grid
@nvelden You can read more about that here, but briefly, it determines what appropriate values are for, say, mtry
based on your real data.
Hello, thank you for the tutorial! Am I right in thinking that if you had a recipe step that created more features e.g. a step_dummy() the finalize(mtry(), vb_train) would need to be something like finalize(mtry(), juice(my_recipe %>% prep())) to capture the extra columns created by the recipe?
@stepbystepdatascience Yes, you can do something like that, or use the functions in tidymodels that support finding the right value; these may be easier to use or help you avoid making mistakes. You can also pass in something like grid = 30
to tune_grid()
, which will then automatically handle finding good/correct ranges for parameters like mtry
.
Hi Julia.
Thank you very much for your videos and all the codes ! I really love them and im learning a lot from them!
Can i ask you a question plz ? I am using this code:
xgb_spec <-
boost_tree(
trees = 300,
mtry = 2,
) %>%
set_engine("xgboost",
nrounds = 20,
max_depth = 2,
gamma = 1) %>%
set_mode("classification")
# workflow
xgb_wf <-
workflow() %>%
add_recipe(rcp) %>%
add_model(xgb_spec)
and when i fit the model i get this warning:
Warning: The following arguments cannot be manually modified and were removed: max_depth, nrounds, gamma.
Is there any way to set these arguments manually ?
Thank you!
In parsnip we harmonize model argument names; see the examples here from ranger. It turns out that those names like nrounds
and max_depth
from xgboost mean the same thing as trees
and tree_depth
in parsnip, so you can't put them in twice. How were you supposed to know that??? It turns out that we used to have tables in our documentation that showed this for models but we somehow removed them at some point; I've opened an issue for us to add those back in. In the meantime, if you run into something like this, unfortunately for now the best place to look is the source code in parsnip, like this for boost_tree()
.
It's nice tutorial. Btw, my Roc curve is totally upside down. Anyone having this issue?
@Akash-Ansari yes, check out this comment above.
Thank you so much. It works now.
Hi Julia,
What is the different bwt sample_size() and min_n() actually?
@conlelevn You can read a brief description here, which we hope lays out their definitions and differences.
Hi Julia,
Thanks for this great video, I am trying to tune a model for ~40potential predictor and I am not able to produce agb_res, the errors I am getting are:
Warning messages:
1: In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed, :
scheduled cores 1, 2, 3 did not deliver results, all values of the jobs will be affected
2: All models failed. See the .notes
column.
nothing informative in 'notes' column!
Thanks, Sami
Can you create a reprex (a minimal reproducible example) for this? The goal of a reprex is to make it easier for us to recreate your problem so that we can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. Once you have a reprex, I recommend posting on RStudio Community, which is a great forum for getting help with these kinds of modeling questions. One thing you will want to probably do is start off without using parallel processing to more easily diagnose the problem. Thanks! 🙌
Hi Julia, Thanks for the reply. I think I figured out was was the problem (
I had many 0s in the data, it's running now but the tune_grid is taking so lung ~12 hours and still running, I am wondering if this is normal? Thanks again, Sami
@SamiFarashi I would say generally no, but it's hard to say without other information. If you are looking at a very long-running model, I recommend starting out with very few tuning parameters, few resamples or a subset of your data, and then scaling up to achieve the best model in a reasonable timeframe. If you can describe your situation in more detail, I recommend posting on RStudio Community, which is a great forum for getting help with these kinds of modeling questions.
Great post and package! Thanks so much!
Hi Julia, excellent tutorial, thanks. I want to multi class classification; how is that possible, please?
Several of the models in tidymodels support multiclass classification! You can see some of them here, but also some models support this natively, like ranger.
Thank you. Does that mean that xgboost as included in tidymodels does not support multi class classification? I have seen examples where num_class is set along with other params, e.g. with objective = "multi:softprob". I am keen to both continue my foray into tidymodels and, for consistency across my project, to use xgboost.
@MonkeyCousin xgboost does support multiclass, yep. You can see an example here.
Hi Julia,
Thanks for this tutorial! When I run this with an XGBoost regression on my own data, everything works! However, the default model (setting trees=1000
and nothing more) performs slightly better than my tuned model!
Any idea if this is common? I'm wondering because I plan to implement this tuning step in many other areas of my code.
If relevant, I did choose the best parameters based on "rsq" rathern than "RMSE" (which seem to be the choices for a regression-based xgb compared to "auc" in the classification version".
@wcwr Take a look at this chapter to understand what might be happening by optimizing $R^2$ instead of RMSE. In general, I would be surprised if an untuned model with default parameters performed better than a model with tuned hyperparameters and I would double check that you're comparing models in a consistent way.
Hi Julia,
In the tune_grid
step, the resamples
parameter was set to vb_folds
, and the output of this final tuned model is xgb_res
. Does this mean that the final model uses the hyperparameters that produced the best metric (AUC/RMSE/r) over the average of the 10 folds? Or could it be the single best fold? Or median perhaps?
Looked for this info in the tune_grid
section of tune.tidymodels.org but I don't think I found it.
Thanks for the wonderful tutorial!
@wcwr The xgb_res
object does not contain any final model. It contains the model performance results that you get across all the model configurations that were tried, estimated using the 10 folds. The next step is to choose the model you want (I did it here with select_best()
) and then to train the model using that specific model configuration chosen via tuning on the whole training set with finalize_workflow()
and last_fit()
. You may want to read this "Getting Started" article on tidymodels.org.
Hi Julia,
Thanks for the blog post and all your videos!
How can you assess accuracy comparisons between train and test sets from the collect_metrics()
call on the final fit?
@jlecornu3 We don't recommend measuring model performance using the training set as a whole for the reasons outlined in this section and there purposefully isn't fluent tooling in tidymodels to do so using a final tuned model. However, if you look at this blog post, the metrics you see with collect_metrics(xgb_res)
are metrics computed using resamples of the training set; this is what we do recommend.
So do you feel this collect_metrics(xgb_res)
is reflective of the true model performance on a test set...? Or would you advise computing accuracy / rmse / some other metric on both the resamples and the test set not using in resamples? If the latter... does tidymodels offer this?
@jlecornu3 Ah, maybe I misunderstood what you were asking. In this blog post:
collect_metrics(xgb_res)
computes metrics from resamples of the training setcollect_metrics(final_res)
computes metrics from the test setYou might want to check out this chapter on "spending your data budget" and how to use the training set vs. test set, as well as how last_fit()
works.
Thanks Julia -- super clear!
hi julia i know this is not the appropriate place to ask this question but i am trying to use mlflow in rstudio and i always faced this error and i did not find any solution :
""" Error in process_initialize(self, private, command, args, stdin, stdout, …
:
! Native call to processx_exec
failed
Caused by error in chain_call(c_processx_exec, command, c(command, args), pty, pty_options, …
:
! Command 'C:/Users/TAKKOUK/AppData/Local/MICROS~1/WINDOW~1/Scripts/mlflow' not found @win/processx.c:982 (processx_exec) """
@mohamedelhilaltek I recommend that you create a reprex (a minimal reproducible example) showing what you want to do and any problems you run into with it, then posting on Posit Community. I know there aren't a ton of mlflow users but generally it's a great forum for getting help with these kinds of questions. Good luck! 🙌
hi julia i have s regression problem where the target variable is influenced by zero more than 50 percent how can i do this with xgboost is there any step
@Hamza-Gouaref Hmmmm, if you had counts with a lot of zeroes, I would suggest that you use zero-inflated Poisson, like in this post. Can you formulate it as a Poisson problem? That would be my mine suggestion.
Hi Julia, thanks for the great info! Very useful! One question, I've been trying, unsuccessfully, to create a couple of partial dependence plots for your example (for both numeric and categorical predictors). I think its because I'm very unfamiliar with the tidyverse approach to predictive modeling and how/were objects are located. Could you direct me to a source that might be helpful (or short code example)? I've been trying to use pdp and the DALEXtra packages. Thanks very much, Joe
@retzerjj Check out this chapter of our book that shows how to make partial dependence plots with DALEXtra. If you are wanting to figure out how to pull out various components of a tidymodels workflow, check out these methods, which can help you extract out the workflow, the parsnip model, the underlying engine model, and so forth.
Thank you for the great video and help. My question is about the vip package to see the variable importance. When I try to install the package I get the error message, "package 'vip' is not available for this version of R". I'm using 4.2.2. Has vip been placed by another package? Thanks.
@mjwera Ooooof, looks like it was archived from CRAN. You can read about their plans here and in the meantime you can install from GitHub.
@mjwera apologies, looks like vip was orphaned for some failed tests from some of the last changes we made, but we never got the warning! Should be back up and running soon!
Thank you!
On Fri, Aug 18, 2023 at 2:15 PM Julia Silge @.***> wrote:
@mjwera https://github.com/mjwera Ooooof, looks like it was archived from CRAN https://cran.r-project.org/package=vip. You can read about their plans here https://github.com/koalaverse/vip/issues/153 and in the meantime you can install from GitHub https://github.com/koalaverse/vip#installation.
— Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/9#issuecomment-1684336616, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZQHFBO5YFMKNNEPRVJP7YLXV65NFANCNFSM4Y4KDWZA . You are receiving this because you were mentioned.Message ID: @.***>
Tune XGBoost with tidymodels and #TidyTuesday beach volleyball | Julia Silge
Learn how to tune hyperparameters for an XGBoost classification model to predict wins and losses.
https://juliasilge.com/blog/xgboost-tune-volleyball/