utterances-bot commented 3 years ago

Partial dependence plots with tidymodels and DALEX for #TidyTuesday Mario Kart world records | Julia Silge

Tune a decision tree model to predict whether a Mario Kart world record used a shortcut, and explore partial dependence profiles for the world record times.

https://juliasilge.com/blog/mario-kart/

csetzkorn commented 3 years ago

Thanks for this. Does bootstraps replicate the training data in this case?

juliasilge commented 3 years ago

@csetzkorn Yes, that's right. Since we use mario_train in bootstraps(mario_train, strata = shortcut), the resamples are from the training data.

csetzkorn commented 3 years ago

Thanks for the reply. So does this mean that you replicate data to train on? I guess this cannot bias the model? I always thought that may be a great way of tackling the curse of dimensionality ...

juliasilge commented 3 years ago

@csetzkorn I'm not totally clear on your question, but you might check this chapter of our book, and especially pay attention to the rsample-to-resample effect; it may be related to your thoughts.

Ji-square commented 3 years ago

Hi Julia im learning a lot. Thanks. Would you like to analize data from exoplanets?

https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=PS&constraint=default_flag=1

that should be pretty interesting to see. Thanks!

juliasilge commented 3 years ago

@Ji-square That sounds like it might be an interesting Tidy Tuesday dataset; you can suggest it as a possible option here!

juliasilge commented 3 years ago

@jcragy You can visit the GitHub issue and unsubscribe; I don't believe it will unsubscribe you if I do anything like delete your comment. Have a good weekend! 🙌

SebastianBehrens commented 3 years ago

Hi Julia, thank you for your tidytuesdays. Have learnt so much already.

Could you demonstrate how to create own step functions in one of your next tidytuesday videos?

juliasilge commented 3 years ago

@SebastianBehrens That's an interesting idea! You may have already seen this, but I want to make sure you know about this guide we have posted.

kamaulindhardt commented 2 years ago

Hi Julia,

Can the ggeffects package be used with cases similar to your example here – except I am working on a regression problem with a Random Forest? My dream is to show marginal effect as "predictions generated by a model when one holds the non-focal variables constant and varies the focal variable(s)."

juliasilge commented 2 years ago

@kamaulindhardt I would think so since it uses predict(). Give it a go!

venkatpgi commented 2 years ago

Hi Julia, Once again a very useful post of yours I have a basic doubt. Why did (didn't) you use tree_res instead of final_fitted for the explainer creation? I thought tree_res object has the complete workflow of the training data and not final_fitted. Pls clarify as when I tried to use tree_res, it throws an error "Warning: Unknown or uninitialised column: trained. Error in !model$trained : invalid argument type"

Thanks Venkat

venkatpgi commented 2 years ago

...continuation Or for that matter, final_res would be more appropriate, I thought so...same error again...

juliasilge commented 2 years ago

@venkatpgi The tree_res object contains tuning results for bootstrap resamples, with metrics and predictions but no fitted workflows. The final_res object contains a finalized (no more tuning parameters), fitted workflow but is not itself a workflow; it is a tibble that also contains predictions and metrics. In this blog post I extract that workflow via final_res$.workflow[[1]], but you can now use the convenience function extract_workflow() instead if you like.

venkatpgi commented 2 years ago

Thanks Julia for the as usual prompt reply of yours and I could understand

I am somewhat stuck again... I was creating a model explainer for a random forest classification model. I did use the following codes:

Step 1: Created model spec

rf_tune_spec_full <- rand_forest( mtry = tune(), trees = 1000, min_n = tune() ) %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")

Step 2: Created workflow

rf_tune_wf_full <- workflow() %>% add_recipe(nupe_rec_rf_XGB) %>% add_model(rf_tune_spec_full)

Step 3: Tuned hyperparameters using tune_grid()

tune_res <- tune_grid( rf_tune_wf_full, resamples = nupe_train_cv, grid = 20 )

Step 4: Picked the best auc

best_auc_rf <- select_best(tune_res, "roc_auc")

Step 5: Finalised the model

final_rf_model <- finalize_model( rf_tune_spec_full, best_auc_rf )

Step 6: Finalised the workflow with the final model

final_rf_wf <- workflow() %>% add_recipe(nupe_rec_rf_XGB) %>% add_model(final_rf_model)

Step 7: Last fit

final_rf_res <- final_rf_wf %>% last_fit(nupe_split)

Step 8: Extracted the workflow (as advised by you)

rf_fitted <- final_rf_res %>% extract_workflow()

Step 9: Created model explainer

library(DALEXtra)

nupe_train is my training data frame and "mort_24h" is the binary outcome variable

rf_explainer <- explain_tidymodels( rf_fitted, data = dplyr::select(nupe_train, -mort_24h), y = as.integer(nupe_train$mort_24h), label = "random forest", verbose = FALSE ) Step 10: Created a break down object

rf_breakdown <- predict_parts( explainer = rf_explainer, new_observation = nupe_train[1,] )

GETTING AN ERROR

Error: Can't subset columns that don't exist. x Column mort_24h doesn't exist.

I did check the df once again. This variable is in the df. Don't know why I get the error. Any thoughts?

PS: I don't know how to embed codes as you have done....apologies for that

juliasilge commented 2 years ago

@venkatpgi Can you create a reprex (a minimal reproducible example) for this? The goal of a reprex is to make it easier for us to recreate your problem so that we can understand it and/or fix it. Once you have a reprex, it is best to post on a more public forum like RStudio Community so more folks can see and respond to your problem.

If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:

install.packages("reprex")

Thanks! 🙌

guarvid commented 2 years ago

Hi Julia! Thanks for a thorough explanation on PDP with tidymodels! If you were to run multiple predictors in the plot, how would you go about it? Can you input more values to the 'variable'-argument in the model_profile function?

juliasilge commented 2 years ago

@guarvid Yep, you can pass in more than one variable to the variables argument. You can also pass in a groups argument as demonstrated here.

AndersAstrup commented 2 years ago

Dear Julia. Thank you for your many great resources!

I have followed your approach in this post, using a XGboost model.

I am able to create PDP´s for my categorical- and some integer predictors in my data. However, I get an error with certain integer-predictors. Any ideas why I might get this error for some of the variables?

pdp_21a <- model_profile(explainer,variables = "TC3G21A") Error: Can't convert from TC3G21A < double > to TC3G21A < integer > due to loss of precision.

Locations: 5, 11, 13, 26, 32, 34, 47, 53, 55, 68, 74, 76, 89, 95, 97, 110, 116, 118, 131, 137, 139, 152, 158, 160, 173, 179, 181, 194, 200, 202, 215, 221, 223, 236, 242, 244, 257, 263,...

It works fine with a similar predictor, see the summary of the one giving the error, and another that works fine below here:

summary(train_data$TC3G21A) #Does not work Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 20.00 25.00 27.37 33.75 80.00

summary(train_data$TC3G21C) #Works Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 11.00 16.00 18.35 21.00 52.00

juliasilge commented 2 years ago

@AndersAstrup Can you create a reprex (a minimal reproducible example) for this? The goal of a reprex is to make it easier for people to recreate your problem so that they can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help page.

Once you have a reprex, I recommend posting on RStudio Community, which is a great forum for getting help with these kinds of modeling questions. Thanks! 🙌

conlelevn commented 2 years ago

@juliasilge Hi Julia, I know that the decision tree does not require data preprocessing, but let say I don't know that and do some data preprocessing for predictors (like make dummy variables for all factors), does it effect my model result?

juliasilge commented 2 years ago

@conlelevn You can read about this topic here in Ch 5 of Feature Engineering and Selection. A very short summary is that using dummy factors with a tree-based model usually gets you the same result, but it takes longer to train.

ashenkin commented 1 year ago

I'm killing myself trying to get model_profile to work, but I keep getting the following error:

Error in UseMethod("predict") : 
  no applicable method for 'predict' applied to an object of class "c('last_fit', 'resample_results', 'tune_results', 'tbl_df', 'tbl', 'data.frame')"

Any thoughts? I've made sure all my packages are updated, etc. Thanks in advance!

juliasilge commented 1 year ago

@ashenkin It looks like you are trying to predict on a last_fit() object, but you want to predict on the fitted workflow that is inside the results of last_fit() (you can get it out with extract_workflow()).

If that's not enough to help, I recommend that you create a reprex (a minimal reproducible example) for your problem and post on RStudio Community. It's a great forum for getting help with these kinds of modeling questions.

AstridSanna commented 11 months ago

Hi Julia, could you explain why we have to use the training data (i.e. mario_train) with the explain_tidymodels, and not the test data? Also, could you confirm that the residuals calculated through this function and using mario_train are calculated from the training data?

On a separate note, could you please recommend one of your posts with an example using random forest for regression, if you have one?

Thank you so much!

juliasilge commented 11 months ago

@AstridSanna

There are some differences of opinion on what data to use for explainers. Check out this and this, for example. If you are ever going to use the results of explainers to change how your model is trained, you must not use the test set for this. I think using the training set is generally safer.
Here is an example that uses random forest for regression.

juliasilge / juliasilge.com

Partial dependence plots with tidymodels and DALEX for #TidyTuesday Mario Kart world records | Julia Silge #32

Partial dependence plots with tidymodels and DALEX for #TidyTuesday Mario Kart world records | Julia Silge

nupe_train is my training data frame and "mort_24h" is the binary outcome variable