utterances-bot commented 2 years ago

Tune xgboost models with early stopping to predict shelter animal status | Julia Silge

Early stopping can keep an xgboost model from overfitting.

https://juliasilge.com/blog/shelter-animals/

luisdominguezromero commented 2 years ago

Hi Julia, Thank you for this video. Very helpful! I was wondering if there was a way to see what specific arguments are available when declaring a computational engine. I was trying to find some info about it in the Tidymodels website but I couldn't find anything. Thank you!

gunnergalactico commented 2 years ago

Hello Dr. Silge, thanks for the analysis. Are you by chance using the dev version of parsnip? I keep getting an error “could not find function stop_iter” even when running your code as is.

Thanks.

juliasilge commented 2 years ago

@luisdominguezromero We recently revamped the parsnip documentation to try to surface this information better. For example, take a look at the main landing page for boost_tree(), which has links for the different engines.

juliasilge commented 2 years ago

@gunnergalactico Ah, you don't need anything but CRAN parsnip, but you do need the GitHub version of dials for the stop_iter() parameter. Sorry about that!! I have got to start adding session info to my blog posts. 🙈

AleLustosa commented 2 years ago

Hi Julia, If I want to perform test base prediction (dataset in kaggle) after last_fit(), how is stopping_fit used? Is it stopping_fit that should be saved to ".rds"? Thank you very much

juliasilge commented 2 years ago

@AleLustosa The object you would want to use for predicting on new data is extract_workflow(stopping_fit) (that is a fitted workflow), so you could store that as something like stopping_fitted_wf and then save to .rds.

SewerynGrodny commented 2 years ago

Hi Julia, thanks for your tutorials, they are very helpful. I just want to do something like add step_holiday do recipes (Christmas) and the add step_lag which would base on this holiday predictor. From my retail experience, people usually organised presents week or two before holiday. How would you do this with recipes package? (basically lag Christmas variable)

I could do this with normal data transformation, but I'm wonder weather it's possible to manipulate with variable that were created during recipe (in step pipeline).

Thanks in advance sewe

juliasilge commented 2 years ago

I believe you should be able to use step_lag() with any new variables you create from step_holiday(). If you end up having trouble, I recommend that you create a reprex (a minimal reproducible example) showing what problems you run into. The goal of a reprex is to make it easier for us to recreate your problem so that we can understand it and/or fix it. A good place to ask questions like that is RStudio Community.

youngjin-lee commented 2 years ago

Hi Julia, Thanks for sharing the tutorial. Could you explain why you chose the best parameters based on "mn_log_loss", but evaluated the model performance in terms of "accuracy" and "roc_auc"?

juliasilge commented 2 years ago

@youngjin-lee No particular reason; you can pass in a custom metric set to last_fit() with the metrics argument to set which metrics to use for the testing set.

eryn-carleton commented 2 years ago

Hi Julia,

A few questions for you:

-Is it possible to plot the tree itself with tidymodels?

-I'm trying to use the vip package to get the variable importance scores, but running into this error with the vi function:

Error in eval(stats::getCall(object)$data) : object 'x' not found

However, the plot itself functions just fine. Have you run into this at all?

-Due to a peculiar circumstance, I don't need to split my data into training and testing. Do have any advice on how to train the model without splitting?

Thanks so much for all you contribute to the R community! Tidymodels and your tutorials have been a huge help for me!

juliasilge commented 2 years ago

An xgboost model is a boosted tree model so it doesn't really make sense to plot "the tree" (there isn't a single tree). If you train a single decision tree, then you can plot it like this or like this.
I haven't had that problem but if you can create a reprex (a minimal reproducible example) for this, you can share it somewhere like RStudio Community and get help.
For a powerful model like xgboost, you pretty much always need separate training and testing sets. If what you mean is that you already have defined training and testing sets (you don't need to split) then you can manually create a split using the development version of rsample.

fdeoliveirag commented 2 years ago

Julia, great contribution as always!

A question: would the proportion 0.8/0.2 used in the early stopping follow the stratification defined in data split/cv?

Thanks in advance

juliasilge commented 2 years ago

@fdeoliveirag No, that is just a random split. You maybe could pass your own validation data (perhaps created via validation_split()) as the xgboost watchlist argument (which would be a boost_tree() engine argument)? I haven't tried that out, I don't think.

If a stratified internal validation set is something you are interested in, you might open an issue on parsnip outlining your use case.

conlelevn commented 1 year ago

@juliasilge Thanks for another great screencast, I feel a little bit confuse between concept of steps and iteration, could you please explain it for me or could you recommend any material to read about it?

juliasilge commented 1 year ago

@conlelevn I'm not quite sure what you're asking. Do you mean how early stopping works (in the context of boosting)? I think Wikipedia is nice on this, and it has a little section specifically on early stopping in boosting.

jlecornu3 commented 5 months ago

Hi @juliasilge - Trying to use the xgboost engine in tidymodels, how can I get around the inclusion of a date column needing to be in format when I create the expanding/sliding window validation folds, but then needing when I come to the xgboost fit?

factor_sliding_folds <- rsample::sliding_period(
  train_set |> arrange(date),
  index = date,
  period = "quarter",
  lookback = Inf,
  skip = 4,
  assess_stop = 1,
  complete = FALSE
)

jlecornu3 commented 5 months ago

Earlier comment should say: be in date format when I create ... and then but then needing numeric when

juliasilge commented 5 months ago

@jlecornu3 I believe you'll want to use some feature engineering like step_date() to build numeric features for xgboost from your date variable.

juliasilge / juliasilge.com

Tune xgboost models with early stopping to predict shelter animal status | Julia Silge #43

Tune xgboost models with early stopping to predict shelter animal status | Julia Silge