Open utterances-bot opened 3 years ago
I am trying to reproduce the results chunk by chunk (even line by line). Everything went fine till the command(s) : doParallel::registerDoParallel() set.seed(74403) ranger_rs <- fit_resamples(ranger_workflow, resamples = water_folds, control = control_resamples(save_pred = TRUE)) where I got 10 times Fold10: preprocessor 1/1: Error: impossible de trouver la fonction "allnominal... Thanks.
sessionInfo() available if necessary/useful
@JMaccario It looks like you might need the newest version of recipes from CRAN.
I updated the libraries, everything OK now
That's really helpful thanks Julia.
How would you save this workflow so you could apply it to new data? If I try something like this I get a message saying the workflow has not yet been trained
model_to_save <- final_fitted$.workflow[[1]]
predict(model_to_save, water_test %>% select(-status_id) %>% slice(1))
Hmmmm, can you update your packages @JoDudding? I am guessing you have an older version of tune or workflows.
Looks like they're slightly behind
workflows_0.2.1
tune_0.1.1
Unfortunately updating the packages is not easy with the setup at work , so I'll see how I get on. Thanks for your help.
Hi Julia, thanks for another great video!
Have you tried the package iml
for variable importance and partial dependence plots? Only would like to know if there is any clear advantage for using the vip
package. Thanks!
The two packages I have used the most for model explainability are DALEX and vip. I like DALEX a lot because it has really thorough support for a lot of model agnostic methods (and great support for tidymodels! 🤩), and I often turn to vip for model specific methods (like the variable importance that you can get out of tree-based models, etc). I have less experience with iml.
Thanks again! I’ll play a bit with DALEX
That's really helpful thanks for sharing. Question: I have unbalance response variable with 2 categories, that is, 160 times 1 and 90 times 0. There is an argument in ranger "inbag" where we can balance the data using stratification. But I don't know How stratification can be used in ranger, would you like to help with that? here is some codes taking 85 0s and 1s
ntree <- 5000
inbag <- replicate(ntree, { bvar <- numeric(nrow(train_data)) indx <- c(sample(which(train_data$target==0), 85, replace = FALSE), sample(which(train_data$target==1), 85, replace = FALSE)) bvar[indx]=1 bvar } , simplify = FALSE)
I was wondering if you know any efficient way to do that.
@AttiqUrRehmann You can check out the various subsampling algorithms available in tidymodels and how they work here.
Thanks for this example Julia! I would like to know why after the step_downsample the dataset remains unbalanced? Thank you so much!
@data-datum Do you mean the final results using the test set? We don't ever want to up/downsample the test set (or assessment sets during resampling) because we need to compute metrics on the dataset as we would see it "in the wild". Only the training set is up/downsampled; tidymodels functions take care of this subtlety for you. To learn more about this, check out:
Thank you for the great tutorial. Is there is a special reason why you are fitting the model again while plotting the Variable Importances. Can the final_fitted
object be used directly if we set the importance
parameter while training it originally?
@oostopitre Yep, that's right. It is slower to train a random forest model if you are also computing importance scores, so generally you don't want to do that when you are fitting on resamples. Maybe I could have updated the model specification to use importance = "permutation"
before I did last_fit()
, but I definitely don't want that when I am doing resampling (lots of fits).
Can we apply the same permutation-based variable importance method with the boxplot chart as well?
@mesdi I don't believe there is a boxplot in this blog post, but you should be able to plot however you prefer once you have the importance scores.
I mean boxplot geom, as described below: ranger_spec %>% set_engine("ranger", importance = "permutation") %>% fit(status_id ~ ., data = imp_data) %>% vip(geom = "boxplot")
(The code chunk doesn't work) Anyway, thanks for the tips.
@mesdi If you are having trouble still, can you create a reprex (a minimal reproducible example) for this? The goal of a reprex is to make it easier for people to recreate your problem so that we can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. Once you have a reprex, I recommend posting on RStudio Community, which is a great forum for getting help with these kinds of modeling questions. Thanks! 🙌
Hi Julia,
I have built a random forest model very similar to your example here. However, I would like to show my VIP with boxplot or violin plot rather than points. I can't get this to work. Do you have any suggestions? I noticed @mesdi asked the same question. So, I made a stack overflow question here too. https://stackoverflow.com/questions/73576720/permutation-based-variable-importance-violin-plots-for-random-forest-in-tidy-m
Thank you!
PS thanks for all the videos, it's great to watch how you think/code..
@LukeSalvato I looked into this a bit more and I think something is not working quite right with type = "permutation"
from vip for tidymodels. Over on your Stack Overflow question, I outlined some possible options.
In here:
collect_predictions(ranger_rs) %>%
group_by(id) %>%
roc_curve(status_id, .pred_n) %>%
autoplot()
can you explain why you use .pred_n
and not .pred_y
?
I have been learning how to use tidymodels and I don't understand why when I plot with .pred_y
I get an inverted curve.
BTW, your tutorials are great!
In R, factor levels are ordered alphabetically by default, which means that "n" comes first before "y" and is considered the level of interest or positive case. You can change this by manually setting the levels of the status_id
factor or by following these startup instructions:
library(yardstick)
#> For binary classification, the first factor level is assumed to be the event.
#> Use the argument `event_level = "second"` to alter this as needed.
Created on 2023-02-27 with reprex v2.0.2
Read more in the docs to get a handle on how that works.
Predict availability in #TidyTuesday water sources with random forest models | Julia Silge
Walk through a tidymodels analysis from beginning to end to predict whether water is available at a water source in Sierra Leone.
https://juliasilge.com/blog/water-sources/