juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

Predict availability in #TidyTuesday water sources with random forest models | Julia Silge #27

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Predict availability in #TidyTuesday water sources with random forest models | Julia Silge

Walk through a tidymodels analysis from beginning to end to predict whether water is available at a water source in Sierra Leone.

https://juliasilge.com/blog/water-sources/

JMaccario commented 3 years ago

I am trying to reproduce the results chunk by chunk (even line by line). Everything went fine till the command(s) : doParallel::registerDoParallel() set.seed(74403) ranger_rs <- fit_resamples(ranger_workflow, resamples = water_folds, control = control_resamples(save_pred = TRUE)) where I got 10 times Fold10: preprocessor 1/1: Error: impossible de trouver la fonction "allnominal... Thanks.

JMaccario commented 3 years ago

sessionInfo() available if necessary/useful

juliasilge commented 3 years ago

@JMaccario It looks like you might need the newest version of recipes from CRAN.

JMaccario commented 3 years ago

I updated the libraries, everything OK now

JoDudding commented 3 years ago

That's really helpful thanks Julia.

How would you save this workflow so you could apply it to new data? If I try something like this I get a message saying the workflow has not yet been trained

model_to_save <- final_fitted$.workflow[[1]]

predict(model_to_save, water_test %>% select(-status_id) %>% slice(1))
juliasilge commented 3 years ago

Hmmmm, can you update your packages @JoDudding? I am guessing you have an older version of tune or workflows.

JoDudding commented 3 years ago

Looks like they're slightly behind workflows_0.2.1 tune_0.1.1 Unfortunately updating the packages is not easy with the setup at work , so I'll see how I get on. Thanks for your help.

ggpinto commented 3 years ago

Hi Julia, thanks for another great video!

Have you tried the package iml for variable importance and partial dependence plots? Only would like to know if there is any clear advantage for using the vip package. Thanks!

juliasilge commented 3 years ago

The two packages I have used the most for model explainability are DALEX and vip. I like DALEX a lot because it has really thorough support for a lot of model agnostic methods (and great support for tidymodels! 🤩), and I often turn to vip for model specific methods (like the variable importance that you can get out of tree-based models, etc). I have less experience with iml.

ggpinto commented 3 years ago

Thanks again! I’ll play a bit with DALEX

AttiqUrRehmann commented 3 years ago

That's really helpful thanks for sharing. Question: I have unbalance response variable with 2 categories, that is, 160 times 1 and 90 times 0. There is an argument in ranger "inbag" where we can balance the data using stratification. But I don't know How stratification can be used in ranger, would you like to help with that? here is some codes taking 85 0s and 1s

ntree <- 5000

inbag <- replicate(ntree, { bvar <- numeric(nrow(train_data)) indx <- c(sample(which(train_data$target==0), 85, replace = FALSE), sample(which(train_data$target==1), 85, replace = FALSE)) bvar[indx]=1 bvar } , simplify = FALSE)

I was wondering if you know any efficient way to do that.

juliasilge commented 3 years ago

@AttiqUrRehmann You can check out the various subsampling algorithms available in tidymodels and how they work here.

data-datum commented 2 years ago

Thanks for this example Julia! I would like to know why after the step_downsample the dataset remains unbalanced? Thank you so much!

juliasilge commented 2 years ago

@data-datum Do you mean the final results using the test set? We don't ever want to up/downsample the test set (or assessment sets during resampling) because we need to compute metrics on the dataset as we would see it "in the wild". Only the training set is up/downsampled; tidymodels functions take care of this subtlety for you. To learn more about this, check out:

oostopitre commented 2 years ago

Thank you for the great tutorial. Is there is a special reason why you are fitting the model again while plotting the Variable Importances. Can the final_fitted object be used directly if we set the importance parameter while training it originally?

juliasilge commented 2 years ago

@oostopitre Yep, that's right. It is slower to train a random forest model if you are also computing importance scores, so generally you don't want to do that when you are fitting on resamples. Maybe I could have updated the model specification to use importance = "permutation" before I did last_fit(), but I definitely don't want that when I am doing resampling (lots of fits).

mesdi commented 1 year ago

Can we apply the same permutation-based variable importance method with the boxplot chart as well?

juliasilge commented 1 year ago

@mesdi I don't believe there is a boxplot in this blog post, but you should be able to plot however you prefer once you have the importance scores.

mesdi commented 1 year ago

I mean boxplot geom, as described below: ranger_spec %>% set_engine("ranger", importance = "permutation") %>% fit(status_id ~ ., data = imp_data) %>% vip(geom = "boxplot")

(The code chunk doesn't work) Anyway, thanks for the tips.

juliasilge commented 1 year ago

@mesdi If you are having trouble still, can you create a reprex (a minimal reproducible example) for this? The goal of a reprex is to make it easier for people to recreate your problem so that we can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. Once you have a reprex, I recommend posting on RStudio Community, which is a great forum for getting help with these kinds of modeling questions. Thanks! 🙌

LukeSalvato commented 1 year ago

Hi Julia,

I have built a random forest model very similar to your example here. However, I would like to show my VIP with boxplot or violin plot rather than points. I can't get this to work. Do you have any suggestions? I noticed @mesdi asked the same question. So, I made a stack overflow question here too. https://stackoverflow.com/questions/73576720/permutation-based-variable-importance-violin-plots-for-random-forest-in-tidy-m

Thank you!

PS thanks for all the videos, it's great to watch how you think/code..

juliasilge commented 1 year ago

@LukeSalvato I looked into this a bit more and I think something is not working quite right with type = "permutation" from vip for tidymodels. Over on your Stack Overflow question, I outlined some possible options.

GiorgiaMori commented 1 year ago

In here:

collect_predictions(ranger_rs) %>%
  group_by(id) %>%
  roc_curve(status_id, .pred_n) %>%
  autoplot()

can you explain why you use .pred_n and not .pred_y?

I have been learning how to use tidymodels and I don't understand why when I plot with .pred_y I get an inverted curve. BTW, your tutorials are great!

juliasilge commented 1 year ago

In R, factor levels are ordered alphabetically by default, which means that "n" comes first before "y" and is considered the level of interest or positive case. You can change this by manually setting the levels of the status_id factor or by following these startup instructions:

library(yardstick)
#> For binary classification, the first factor level is assumed to be the event.
#> Use the argument `event_level = "second"` to alter this as needed.

Created on 2023-02-27 with reprex v2.0.2

Read more in the docs to get a handle on how that works.