juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
41 stars 27 forks source link

Tuning random forest hyperparameters with #TidyTuesday trees data | Julia Silge #6

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Tuning random forest hyperparameters with #TidyTuesday trees data | Julia Silge

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using a #TidyTuesday dataset from earlier this year on trees around San Francisco to show how to tune the hyperparameters of a random forest model and then use the final best model.

https://juliasilge.com/blog/sf-trees-random-tuning/

cvaldezerea commented 3 years ago

Hi, is there a way to tune a model using also the testing data. I mean, we should train the model with the training data and picking the best using the test, is that correct?

juliasilge commented 3 years ago

You can read more about spending your data budget in this chapter. The purpose of the testing data is to estimate performance on new data. To tune a model or pick the best model, you can use resampled data or a validation set, which we like to think of as a single resample.

Chihui8199 commented 3 years ago

Hi Julia, from the last step, how can I get the confusion matrix? I can't figure it out!

juliasilge commented 3 years ago

@Chihui8199 You should be able to do this to create a confusion matrix for the test set:

final_res %>%
    collect_predictions() %>%
    conf_mat(legal_status, .pred_class)
michael-hainke commented 3 years ago

Thank you so much for these great blog articles which have really helped me working with tidymodels! One question, once I've done my 'last_fit', how best to save the model and use it at a later date for predictions on new data? I can't seem to find any good resources on deploying fitted models. Thanks!

Chihui8199 commented 3 years ago

Hey Julia! I never expected that you will respond!!! That was immensely helpful! Enjoyed the guide a lot :)

juliasilge commented 3 years ago

@michael-hainke The output of last_fit() contains a workflow that you can use for prediction on new data. I show how to do that in this post and this post.

michael-hainke commented 3 years ago

@juliasilge Thanks for the quick reply, this is great!

Chihui8199 commented 3 years ago

@Juliasilge what does the grid = 20 under tune_grid exactly mean. After reading the documentation I still don't quite understand. Thank you in advance :)

juliasilge commented 3 years ago

@Chihui8199 Setting grid = 20 says to choose 20 parameter sets automatically for the random forest model, based on what we know about random forest models and such. If you want to dig deeper into what's going on, I recommend this chapter of TMwR.

yerazo3599 commented 3 years ago

Hi Julia, I love your content, it is very helpful.

I am running this code but with my data. I have followed this tutorial but Iwhen i run this line

set.seed(345)
tune_res <- tune_grid(
  tune_wf,
  resamples = trees_folds,
  grid = 20
)

I get this error message Error: To tune a model spec, you must preprocess with a formula or recipe . I tried to apply the prep() function but it doesn't work. Could you help me with this please.

juliasilge commented 3 years ago

@yerazo3599 Take a look at your tune_wf object; it sounds like it does not have a formula or recipe added.

To get some more detailed help, I recommend laying out a reprex and posting on RStudio Community. Good luck! 🙌

ntihemuka commented 3 years ago

Hey Julia, loved this video very much! just started learning R, I am doing a master's in data science. I am also considering getting a PhD when I finished. My career aspirations are not to teach but to just be good at machine learning and data science. As someone who has been there, would you recommend getting a PHD or just self-learn with books and resources written by the experts? Also getting a job that pays bank in tech wouldn't be awesome :)

juliasilge commented 3 years ago

Generally, if your goal is to work in data science as a practitioner, I don't think a PhD is the way to go. You might check out https://shouldigetaphd.com/ for some more perspective on this!

ntihemuka commented 3 years ago

thanks julia

On Thu, May 13, 2021 at 9:49 PM Julia Silge @.***> wrote:

Generally, if your goal is to work in data science as a practitioner, I don't think a PhD is the way to go. You might check out https://shouldigetaphd.com/ for some more perspective on this!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/6#issuecomment-840825185, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS3DBKIU5TRO3MZAEFAPD33TNQ3LZANCNFSM4Y2HWGFA .

guberney commented 3 years ago

Hello Julia, thank a lot for the post and the video. Do you recommend to add importance variable for training at the initial tune_spec or update the workflow at the final_wf?

juliasilge commented 3 years ago

@guberney I believe (you can check this for yourself with your data to see if it makes a difference) that training can be slower when importance is being computed, so you may not want to include it for all tuning iterations.

canlikala commented 3 years ago

Hello Julia , again thank you for your amazing works. How I can use weights in this model ?

juliasilge commented 3 years ago

@canlikala tidymodels doesn't currently support case weights, but we are tracking interest in this issue and ideas on implementation here.

kamaulindhardt commented 3 years ago

Hi Julia and everyone,

Hope you had a good weekend. I am working with a large geo-referenced point data with biophysical variables (e.g. soil pH, precipitation etc) and would like to test for spatial autocorrelation in my data. Any tidy-friendly way to test for this particular autocorrelation? As it is an element important for my further Machine Leaning regression analysis I would need to take it into account to not produce algorithms/models that a erroneous in predicting the outcome variable.

Thank you

juliasilge commented 3 years ago

@kamaulindhardt A good place to ask a question like this is on RStudio Community. Be sure to create a reprex to show folks what kind of data you are dealing with.

nvelden commented 3 years ago

Is there a specific reason you used roc_auc as a metric for tuning and not accuracy?

juliasilge commented 3 years ago

@nvelden I think it's generally more rare for overall accuracy to be the most useful/appropriate metric for real-world classification problems. Making a metric choice is super connected to your specific problem in its real context. You can check out metric options in tidymodels here.

data-datum commented 2 years ago

Hi Julia, your posts are helping a lot!!! I would like to know if I have to sample a big dataset, to a get a represntative sample, if there is available any option with tidymodels? I have thought that rsample package would be a choice, but I do not know about it. Thanks!

juliasilge commented 2 years ago

@data-datum If you want to subsample your data as part of feature engineering to balance classes, take a look at themis. If you just want to sample down overall, I'd probably use slice_sample() from dplyr.

hardin47 commented 2 years ago

I asked the same question at https://juliasilge.com/blog/astronaut-missions-bagging/ (so apologies for bugging you twice), but it feels so much more relevant to this blog.

why tune mtry using cross validation instead of out of bag information? seems like oob tuning could be very useful and helpful!

thank you for all that you do. your screencasts are amazing!

juliasilge commented 2 years ago

@hardin47 I think the main reason is that the performance estimates you get if tuning on OOB samples don't always turn out well, and maybe even mtry doesn't get chosen well.

wsteenhu commented 2 years ago

Dear Julia, thank you very much for these screencasts and useful information. I am currently working on a 2-class classification problem with a high number of predictors (~300) and limited number of samples (150 class 1/300 class 2). In line with @hardin47's question, I was considering of optimising tuning parameters using the OOB-errors instead of CV-errors. The paper you refer to seems to support this to a certain degree, especially when sample size is not extremely small and when using stratified subsampling (to avoid severe class imbalances in in-bag/out-of-bag samples). Of course tuning parameters using the OOB-errors would be beneficial, as I can use more data to build the model. Also in literature, this seems like a quite well supported approach, mostly noting that OOB may be overly pessimistic. I know on the other hand {tidymodels} focusses on 'empirical validation' (=CV). Do you have any additional thoughts on this? Would you consider tuning based on OOB-errors (is that even possible in {tidy models} when the number of samples is limited?

juliasilge commented 2 years ago

@wsteenhu We don't super fluently support getting those OOB estimates out because we believe it is better practice to tune using a nested scheme as a matter of general pratice, but if you want to see if it works out OK in your particular setting, you might want to check out this article for how you might manually handle some of the objects/approaches involved. This article might also help you extract the bits you want to manually get at.

ArianaSam commented 2 years ago

Hi Julia,
I am working on a multi-class classification problem. In the variable importance step, I would like to plot variable importance for each class to find out whether a variable is more important for discriminating one class from another. I used local.importance = TRUE, but it didn't work.

Thank you

juliasilge commented 2 years ago

@ArianaSam I'm not sure if global variable importance (like what I show here) will get you what you want; it gives a measure of importance for the whole model overall. Instead, you might try making partial dependence profiles. Check out this chapter to learn more about these differences.

jfmusso commented 2 years ago

Hi Julia. Thanks for doing the videos. Very helpful. Is there anything in tidymodels that would allow me to use alternative metrics (other than the defaults, like AUC) when tuning parameters? I'm working on a binary classification problem, and there are so many good metrics out there, it puzzles me when packages only include one or two possible default metrics. I particularly like using the shortest distance to (0,1) on the ROC curve, or the Youden Index. Both are easy enough to calculate from sensitivity and specificity. Why not thrown them all in there?

juliasilge commented 2 years ago

@jfmusso Yes, for sure! You can tune with any of the metrics in the yardstick package by setting a metric_set(); check out this blog post. You can also create your own custom metric and tune with that; check out this blog post and this article.

jfmusso commented 2 years ago

Julia, I'm curious as to why you chose to deal with the class imbalance by downsampling? I understand for purposes of this video/article, it's quick and easy, but were there other considerations as well? As you know, there are many methods for dealing with class imbalance in a binary classification problem. My preference is to avoid reducing or tinkering with the data, and instead alter the decision rule (cutoff) for predicting outcomes based on the probabilities generated during training. Instead of using the default "majority rules", I select the probability cutoff that results in the best value for the metric I've chosen to evaluate the model's performance. This, in effect, makes the cutoff akin to a hyperparameter that requires tuning. Yet, packaged algorithms tend to omit this important tool from the model building process. On top of that, this approach seems to be somewhat controversial among purists in data science who believe that because the decision rule isn't technically a part of the "model," but more of a business decision, it's somehow an inferior approach that a "real" data scientist shouldn't use. I personally think this is a distinction without a difference. In the end, anything that enhances predictive ability can be considered part of the "model." This is particularly true in a case where the true positive rate is no more or less important than the true negative rate. Difficult business judgments about the relative costs of different kinds of mispredictions don't really come into play. But even if they did, why not incorporate the cutpoint into the model tuning process? The business context influences many aspects of model building, not just the cutoff point. Sorry for ranting. My real question for you is if I wanted to test alternate cutoffs in the tidymodels world, how would I do it?

juliasilge commented 2 years ago

@jfmusso You can read more about subsampling and what metrics it does/does not tend to improve here, although you may be familiar with all that. We (the tidymodels team) definitely are in favor of understanding how to appropriately handle probabilities in different situations; check out a tidymodels approach for post-processing probabilities (like finding an optimal probability threshold) in probably. You might especially be interested in this vignette.

hardin47 commented 2 years ago

@hardin47 I think the main reason is that the performance estimates you get if tuning on OOB samples don't always turn out well, and maybe even mtry doesn't get chosen well.

But the paper @juliasilge cites says that OOB is great as long as we are using stratified sampling instead of naive subsampling. To me this is more reason to include OOB options in tidymodels.

In line with results reported in the literature [5], the use of stratified subsampling with sampling fractions that are proportional to response class sizes of the training data yielded almost unbiased error rates in most settings with metric predictors. It therefore presents an easy way of reducing the bias in the OOB error. It does not increase the cost of constructing the RF, since unstratified sampling (bootstrap of subsampling) is simply replaced by stratified subsampling.

juliasilge commented 2 years ago

@hardin47 That's a great point, and I'm glad you opened tidymodels/planning#25 to collect thoughts.

Shrey088 commented 2 years ago

Hi Julia! Thank you for explaining this topic so well.. For variable importance, I get an error in- set_engine(): ! object should have class 'model_spec'. I am unable to identify the reason. Can you please help..

juliasilge commented 2 years ago

@Shrey088 Can you create a reprex (a minimal reproducible example) for this? The goal of a reprex is to make it easier for us to recreate your problem so that we can understand it and/or fix it. A good place to post such a reprex is the tidymodels tag of RStudio Community.

If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:

install.packages("reprex")

Thanks! 🙌

SimonMontfort commented 2 years ago

Hi Julia, Thank you so much for your tutorials. I really dig them!

I have an important question though based on an observation that I made for your mode and also for mine using different data: Why is the performance of the training model worse than that of the model for which out-of-sample performance is evaluated? It seems that this is the case also for my models with different data. Wouldn't we expect the performance on the training model show_best(regular_res, "roc_auc") which here is roc_auc = 0.936 to be higher than for final_res %>% collect_metrics() which is roc_auc = 0.939. Here it the performance based on the training data is worse than that of the out-of-sample test set. In my case the same is true. Of course, it also shows that the model does not overfit but it seems to be rather counter-intuitive that the performance does not decrease, at least just a little. In my case, performance of the out-of-sample test set consistently slightly higher for several models with different dependent variables which may make a paper reviewer suspicious that something with the testing and training data got mixed up. Am I misunderstanding something here? If so what?

Remark: I would expect having to provide trees_test to last_fit() instead of trees_split, so that I know that I supply the test set to the last_fit() function. trees_split leaves me guessing what I really supply. Or am I again not understanding something correctly?

Thank you so much for taking the time to develop the models and to answer the questions. It is really helpful and made machine learning accessible for me.

juliasilge commented 2 years ago

@SimonMontfort One thing to keep in mind is not just the point estimate for the metric such as ROC AUC but also the variance. In this blog post, I don't ever print out the metrics from resampling (although I believe I did in the video) where you can see both the mean and standard error. Then you also need to keep in mind that you are measuring the test data metrics with a smaller proportion of the data set (so again, uncertainty in that measure). I'd be surprised if you see results where you always, significantly see better performance with the testing set than from resampling. Certainly it's not unexpected for one single dataset, especially if it's not enormous and the test set isn't that big. If you want to lay out what you think is happening in more detail, the ML category on RStudio Community is a good place to post for discussion.

The last_fit() function needs both the training data (to fit) and testing data (to evaluate) which is why have the argument as the split, which contains exactly that information. 👍

nspyrison commented 2 years ago

Hi Julia,

These are amazing resources for the community. Thanks for producing them and responding to questions. Have really enjoyed applying tidymodels alongside your blog posts!

izzydi commented 1 year ago

Hi! One question. If we want to save some time, can we skip the step were you tune one more time with "regular_res" and go straight to use: select_best(tune_rs, "roc_auc") ? Thanks!

juliasilge commented 1 year ago

@izzydi Yep, for sure. If you check out some other tuning tutorials like this one you'll see that's possible (and also a more typical approach). In this particular case, the best values using the default ranges were at the end of the parameter space we tried out, especially for min_n. This means that I wanted to extend out the range and try with a regular grid to get a better picture of what the best value would be. Often this would not be a worthwhile thing to really dig in to, though!

jlyonsreid commented 1 year ago

Hi Julia, thank you for the really helpful post. I was wondering if it is possible to integrate with kernelSHAP?

juliasilge commented 1 year ago

@jlyonsreid I have never tried that myself, but I recommend that you create a reprex (a minimal reproducible example) showing what you want to do and any problems you run into with it, then posting on RStudio Community. It's a great forum for getting help with these kinds of modeling questions. Good luck! 🙌

olusolaxx commented 1 year ago

Hi Julia, how can I generste other metrics apart from roc_auc and accuracy from

final_res %>% collect_metrics()

Also, the test set was never used to predict outcomes and nothing was said about bake(). Thanks

juliasilge commented 1 year ago

@olusolaxx

bnagelson commented 11 months ago

Hi Julia,

I have followed along with your code here. When I run tune_grid(), I recieve a message (while the code is executing) that reads: "Creating pre-processing data to finalize unknown parameter: mtry". The tune_grid() function takes at least 9-10 minutes to execute. Is this expected behavior? I encounter the same issue with my own dataset (which is much smaller than the data you use in this example). Thank you!

juliasilge commented 11 months ago

@bnagelson Oh, that doesn't seem right at all to me. It takes ~10 min but then does finish with no errors? Have you tried some of the basic tuning examples? Do they run OK for you? And then if you switch out for random forest, how does that do?