juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
42 stars 28 forks source link

Tune random forests for #TidyTuesday IKEA prices | Julia Silge #16

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Tune random forests for #TidyTuesday IKEA prices | Julia Silge

Use tidymodels scaffolding functions for getting started quickly with commonly used models like random forests.

https://juliasilge.com/blog/ikea-prices/

mwilson19 commented 3 years ago

Hi Julia - why in tune_grid you set the default value to grid = 11 ? Thank you!

juliasilge commented 3 years ago

@mwilson19 Setting grid = 11 says to choose 11 parameter sets automatically for the random forest model, based on what we know about random forest models and such. If you want to dig deeper into what's going on, I recommend this chapter of TMwR.

abjeroen commented 3 years ago

Hi Julia, I keep getting the error below at the ranger_tune step. It's coming from step_clean_levels(), but I'm unable to fix it. I've tried to reinstall all packages, but no luck. Any help?

Error: preprocessor 1/1: Error in UseMethod("prep"): no applicable method for 'prep' applied to an object of class "c('step_clean_levels', 'step')"

juliasilge commented 3 years ago

@jeroenadema do you have an updated installation of textrecipes, which is where step_clean_levels() is from? I would try updating your packages, probably including recipes and textrecipes (maybe force reinstalling those two via install.packages()).

James-8212913 commented 3 years ago

Hi Julia, the blog you have here in concert with the published work @ tmwr.org are brilliant references that I have been using as part of my learning journey - thank you.

My question is around random forrest models, specifically tune_grid and the computational expense when running it. My case model has 9 predictors run over 17000 observations to make a classification (Y/N). I have used vfold_cv cross validation as part of the tune_grid. I am running parallel processing. I have run code all night and it still hasn't returned a result. Is there a rule of thumb you can apply to get an understanding of how computationally expensive the different parameter settings are between the model, resampling and tuning. I know this is a bit like how long is a piece of string but any advice would be appreciated - a methodical process to step through the parameters to get a model that is 'good enough' without it taking days to run. Many thanks in advance. James

juliasilge commented 3 years ago

@James-8212913 Definitely a tough issue! I imagine you have already found this section, but I want to draw your attention to it if not. One thing to keep in mind with random forests is that often, they don't improve much with tuning, as long as you have enough trees, like > 1e3 or so. In a situation like this, I would start out with no tuning and just fit the model one time to see how long it takes, then fit a model with no tuning parameters to resampled folds to estimate how well it does and see how long it takes, and then scale up from there. Another option to consider is a bagged tree, which can do as well as a random forest or xgboost but fit a lot faster in some situations, because you don't usually tune them.

James-8212913 commented 3 years ago

Hi Julia, this is excellent advice - just the steer I was looking for. Thank you for the quick turn around with a high quality response. Much appreciated.

castaff commented 3 years ago

Hi Julia, Happy Friday - I have a maybe "philosophical" question related to this. If I am hoping to use cross-validation to build confidence in my model, should this tuning take place first and then use those optimum hyperparameters in my model that I am fitting on resamples? Or would one.. tune "inside" each of the cross validation samples? I am not even sure if my question makes sense so any insight would be much appreciated! Thank you.

juliasilge commented 3 years ago

@castaff I recommend that you take a look at our book Tidy Modeling with R, especially the chapters on "A model workflow", "Resampling for evaluating performance", and "Modeling tuning and the dangers of overfitting". When you tune a model (try out different possible hyperparameters), you fit all the possible candidate models to all the resamples. This means that you have a reliable estimate of performance for each candidate model. You can then choose the most appropriate candidate model and typically fit one more time using the training set as a whole.

kamaulindhardt commented 3 years ago

Dear Julia,

I am running out of options, so I resolve to ask you here directly. I am struggling to make my column variables unique after they have been passed through the step_dummy() function. This results in the error

"Error: Column names SubPrName.Code__TPM.N, and 35 more must not be duplicated. Use .name_repair to specify repair."

I have posted my issue on Stack overflow with reprex included. Any help would be highly appreciated. (https://stackoverflow.com/questions/68222105/step-dummy-dealing-with-duplicated-column-names-generated-by-recipe-steps-ti)

goodyonsen commented 2 years ago

I’ve just discovered your tutorials on YouTube which led me here. Both the video and this page are so informative, and very much useful for a rookie like me. You almost left nothing unexplained in the video. So, thank you for that as well. I will ask your advise on two issues: 1) I was actually looking for multiple linear regression examples with the same libraries and similar Boiler Plate codes. I believe it was the glmnet of the usemodels library. It says in its doc page that it was primarily developed for glms but can be adapted for lms too. I was working productivity prediction of actual productivity of a dataset from the garments industry. I tried to apply your approach to the model I worked on like this:

######################################################

#### MAKING RESAMPLES FOR TUNING AND COMPARING MODELS: 
my_seed1 <- 777
set.seed(my_seed1)

#### Bootstrap sampling:
sewing_strap <- bootstraps(sewing_train, strata = actual_productivity)
sewing_strap # 25 resamples with splits ranging from 517(train)/194(test) to 517(train)/207(test)
tail(sewing_strap, 15)

use_glmnet(actual_productivity ~ ., data = sewing_train)

glmnet_recipe <- 
  recipe(formula = actual_productivity ~ ., data = sewing_train) %>% 
  step_zv(all_predictors()) %>% #zero variance filter
  step_normalize(all_predictors(), -all_nominal()) # to normalize var distros

glmnet_spec <- 
  linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet") 

glmnet_workflow <- 
  workflow() %>% 
  add_recipe(glmnet_recipe) %>% 
  add_model(glmnet_spec) 

glmnet_grid <- crossing(penalty = 10^seq(-6, -1, length.out = 20), 
                        mixture = c(0.05, 0.2, 0.4, 0.6, 0.8, 1)) 
glmnet_tune <- 
  tune_grid(glmnet_workflow, resamples = sewing_strap, grid = 10) 

show_best(glmnet_tune)
show_best(glmnet_tune, "rsq”)

######################################################

I received no error and seems the outputs are OK. I am wondering if glm was the right way to predict MLR.

2) I used the below lines for skew check and normalization:

recipe(actual_productivity ~ ., data = sew_train) %>%
  step_YeoJohnson(all_numeric()) 
recipe(actual_productivity ~ ., data = sew_test) %>%
  step_YeoJohnson(all_numeric()) 

I don’t know if step_YeoJohnson() was the right choice for both purposes. So, I don’t recall you normalized your data before modeling or splitting. Was there a reason for that? Like may be the ranger took care of it? I also would like to ask, should the normalization be a requirement, is normalizing both test and train data a common practice?

Thanks for your time in advance... God bless..

juliasilge commented 2 years ago

@goodyonsen You definitely can use glmnet for multiple linear regression, for sure. It does regularization so you have to decide if that is what you want/need. I didn't normalize my data because it's not needed for tree-based models, but a model like glmnet does need you to put all your features on the same scale (because of the regularization). You can check out this page for technical details on glmnet, and this appendix for info on recommended preprocessing by model.

goodyonsen commented 2 years ago

The appendix and the book are useful guidances indeed. I saved them both. Thank you so much Julia. I’m learning all these for -hopefully one day- being able to apply them successfully on finance related ML projects. I have a BA with major in International Finance, but the theoretical knowledge gets quickly outdated as the coding world changes things in the speed of light. So, I was wondering if you could recommend some online free ML tutorials specialized on finance (i.e quant data and algorithms) for speedy catch up. I do my own research but inexperience in the coding side makes me think if I got the right stuff to begin with. That’s why I wanted to ask your opinion on this. I appreciate all the help.

juliasilge commented 2 years ago

@goodyonsen If you are looking for a general tutorial on supervised ML, I have put together this interactive course. One person who does more finance oriented tutorials is Matt Dancho, so you might check out his resources.

goodyonsen commented 2 years ago

@juliasilge Thank you...

conlelevn commented 2 years ago

@juliasilge

Hi Julia, when I run into these code lines:

set.seed(8577) doParallel::registerDoParallel()

ranger_tune <- tune_grid(ranger_workflow, resamples = ikea_folds, grid = 11 )

There is an error has come up say: "Error in UseMethod(\"prep\"): no applicable method for 'prep' applied to an object of class \"c('step_clean_level')..."

When I try to remove the step_clean_levels() in recipe() then it can run again. Im not sure its a bug or not but pls take some time to take a look at that. Thanks

juliasilge commented 2 years ago

@conlelevn It sounds like you have some versions of packages that aren't working together. I recommend that you update to the current CRAN version of packages, especially recipes and textrecipes.

dyanifrye commented 1 year ago

Hi Julia! Thank you so much for this helpful tutorial! I'm having a strange issue where whenever I make a vip, the top 5 variables are different. I run everything in this code below and it creates a variable importance plot. Then I run the workflow() line again and it creates a different variable importance plot. I expected it to create the same plot over and over again. Could you shed some light on why this might be occurring?

set.seed(815)
pH_split <- initial_split(pH_raw)
pH_train_new <- training(pH_split)
pH_test_new <- testing(pH_split)

rf_mod_final <- rand_forest(mtry = 5, trees = 100, min_n = 2) %>%
  set_mode("regression") %>%
  set_engine("ranger")

pH_rec <- recipe(pH_water ~ ., data = pH_train_new, importance = TRUE)
rf_wf <- workflow(pH_rec, rf_mod_final)
pH_fit<-fit(rf_wf, pH_train_new)

imp_spec <- rf_mod_final %>%
  set_engine("ranger", importance = "permutation")

workflow() %>%
  add_recipe(pH_rec) %>%
  add_model(imp_spec) %>%
  fit(pH_train_new) %>%
  pull_workflow_fit() %>%
  vip(num_features = 5, aesthetics = list(alpha = 0.8, fill = "black"))
juliasilge commented 1 year ago

@dyanifrye Looks like you are using random forest here which does involve randomness (even in the name!). If your goal is have a reproducible model where you always get the same value, then use set.seed() liberally, especially right before you fit the model. If your goal is to understand which variables are really most important across different possible random forest configurations, then I suggest that you use resampling like in this example.

dyanifrye commented 1 year ago

Hi Julia, thanks so much for recommending the resampling approach to visualize the most important variables across more than just a single model configuration. This is the best approach for my situation! Much gratitude.

nikosGeography commented 1 year ago

Hi Julia, I have a similar regression task which I have to predict a coarse spatial resolution (400m pixel size) raster to a finer spatial scale (100m pixel size). I found your tutorial helpful so I am trying to replicate some of its steps. In your tutorial, the test already has the response variable in a column so they can easily calculate the RMSE or MSE or whatever. In my case, I do not have the response variable at the fine spatial scale (my goal is to predict it).

My question is, should I split the data set (at the coarse spatial scale) into training and test set in order to fine tune a RF model (using the training set) and test it's predicting power on the test set by comparing the predictions vs observed data, before I apply it to predict the response at the fine spatial scale? Or should I use the whole data set (without prior spliting it) and try to predict the response at the fine spatial scale?

What are your thoughts on that? Does it make sense to split the data set and make predictions at the coarse spatial scale before I move on to a finer spatial scale?

juliasilge commented 1 year ago

@nikosGeography It might be good to double check on this with someone who has a lot of spatial experience (i.e. not me) but based on a typical ML strategy, I would take the data you have (coarse scale) and use a spatial resampling strategy to create folds. I would tune your model using this resampling strategy. For spatial data with autocorrelation, the performance estimate from this approach will likely be more accurate (not too optimistic) than a random test/train split. Once you have a tuned model whose performance you understand, you can then predict at the fine scale.

nikosGeography commented 1 year ago

Thanks, I wasn't aware of the spatial resampling. I will definetly have a look at it.

alicanaza commented 1 year ago

I julia I have a similar error: Error in UseMethod("train") : no applicable method for 'train' applied to an object of class "formula" My csript code Could you please hep me?

juliasilge commented 1 year ago

Hmmmm @alicanaza that's not quite enough to go on. I recommend that you create a reprex (a minimal reproducible example) showing what you want to do and any problems you run into with it, then post on Posit Community. It's a great forum for getting help with these kinds of modeling questions. Good luck! 🙌

carlosvblessa commented 12 months ago

Hello Julia, I'm trying to learn how to tune hyperparameters by following your tutorial. When I run:

ranger_tune <- tune_grid(ranger_workflow, resamples = ikea_folds, grid = 11 )

The console displays the following: i Creating pre-processing data to finalize unknown parameter: mtry

And nothing happens, the percentage of processor usage is close to zero.

If I create a grid:

ranger_grid <- grid_latin_hypercube( min_n(), finalize(mtry(), ikea_train), size = 30 )

And I change the tune to:

ranger_tune <- tune_grid(ranger_workflow, resamples = ikea_folds, grid = ranger_grid )

The tuning is ok, but the results are inferior.

collect_metrics(ikea_fit)

A tibble: 2×4

.metric .estimator .estimate .config

1 rmse standard 0.322 Preprocessor1_Model1 2 rsq standard 0.746 Preprocessor1_Model1

Would you help me?

Many thanks in advance. Carlos.

juliasilge commented 12 months ago

Hmmmmm @carlosvblessa I have not seen a problem like this. Have you tried some of the basic tuning examples? Do they run OK for you? And then if you switch out for random forest, how does that do?

carlosvblessa commented 11 months ago

Hi Julia,

For some unknown reason, when I don't add parallel processing, the procedure following the tutorial works.

ranger_tune <- tune_grid(ranger_workflow, resamples = ikea_folds, grid = 11 )

i Creating pre-processing data to finalize unknown parameter: mtry

show_best(ranger_tune, metric = "rmse")

A tibble: 5 × 8

mtry min_n .metric .estimator mean n std_err .config

1 2 4 rmse standard 0.340 25 0.00202 Preprocessor1_Model10 2 4 10 rmse standard 0.348 25 0.00226 Preprocessor1_Model05 3 5 6 rmse standard 0.349 25 0.00235 Preprocessor1_Model06 4 3 18 rmse standard 0.350 25 0.00218 Preprocessor1_Model01 5 2 21 rmse standard 0.352 25 0.00200 Preprocessor1_Model08
juliasilge commented 11 months ago

@carlosvblessa So you are saying it works without parallel processing and does not work with parallel processing? You might want to read:

One thing to keep in mind is that you must set up parallel processing in a way appropriate for your OS (like Windows vs. Linux vs. macOS).