juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

Predict ratings for #TidyTuesday board games | Julia Silge #60

Open utterances-bot opened 2 years ago

utterances-bot commented 2 years ago

Predict ratings for #TidyTuesday board games | Julia Silge

A data science blog

https://juliasilge.com/blog/board-games/

nguyenbui2k commented 2 years ago

Hi, thanks u for this great tutorial. I don't know why my code failed in this Error: There were no valid metrics for the ANOVA model. Run `rlang::last_error()` to see where the error occurred.

I am pretty sure that I run the same code like your tutorial

rserran commented 2 years ago

Hi, Julia!

I was curious about the possibility of adding 'boardgamemechanic' feature that you mentioned. The addition further reduces the 'rmse' with the same tuning parameters. Here are the console output and plots:

What do you think?

juliasilge commented 2 years ago

@nguyenbui2k Hmmmm, I would double check that you haven't missed any arguments or anything. Also, you could see if you need to update any packages from very old versions.

@rserran Oh, that looks fantastic! Super interesting that people (the raters, at least) do NOT like roll, spin, and move. 😅

pgomba commented 2 years ago

Hi Julia, thanks for another useful video & post. I have a question about the step update_role(name, new_role = "id"), which I did not know was a thing until today.

On a model with a classification (rather than regression) mode, do you think its possible once the model is fitted in the testing data to extract all the individuals that have been wrongly classified along their ID data? I'm asking this because I'm curious to learn more about the reason some of my samples are missclasified.

Thanks!

juliasilge commented 2 years ago

Yes @pgomba absolutely; that kind of use case is exactly why we support roles that are not predictor/outcome in tidymodels. I don't think I have an in-depth example anywhere right now using them after the fact but you can check out:

Oh wait, this blog post shows how to use an ID variable after fitting/estimating, for plotting.

pgomba commented 2 years ago

Thanks! Looks promising.

Basically I'm trying to associate the tibble obtained via final_fit %>% collect_predictions() with the ID column from the testing data To do so I've been using cbind to merge the ID column from the testing data with the results of final_fit %>% collect_predictions(). Somehow is working as intended! I'll check soon if the update_role step keeps the ID associated to the sample after the function last_fit() has been used.

Thanks!

Krusenstierna commented 2 years ago

Wonderful video and blog! Is there a way to force in 1 variable in tidymodel, i.e. in recipe(average ~ ., data = game_train), for models that penalize variables (such as Lasso)? Lets say that you with prior knowledge new that "minage" is important for the question at hand and always want that this variable should always be included in addition to all other variables that the Lasso keep after penalty, eventhough "minage" actually would be removed if not forced to be in the model.

Thanks for your time!

juliasilge commented 2 years ago

@Krusenstierna You should be able to do that with glmnet by passing penalty.factor as an engine argument when creating your model specification.

chadallison commented 2 years ago

Hi Julia,

I am following along with your recent board game ratings video and am running into an issue with the tune_race_anova() function. When running it, I would receive the following error messages.

So from here, I tried switching over to the tune_grid() function instead to try and troubleshoot why these errors might be occurring. After doing this, I get the same "all models failed" warning, so I accessed the .notes column to try and troubleshoot further. Upon doing this, I am receiving the following message for each of the 20 iterations.

I tried commenting out the str_replace_all() component of the creation of the split_category() function, but the same issue then arose with the str_to_lower() function where I am told the object is not found. Is this an error you have run into before, and if so, do you know how I may go about understanding what is going wrong behind the scenes?

juliasilge commented 2 years ago

Ah @chadallison it seems like you haven't loaded the stringr package possibly? Or are you using PSOCK clusters for parallel processing? The issue is that the stringr package isn't loaded where the tuning is happening. Can you try loading the stringr package like pkgs = c("stringr", "any_other_package_you_need") in control_grid()?

mounta74 commented 2 years ago

When tune_race_anova() function, I receive the following : Warning: All models failed. See the .notes column. Error: There were no valid metrics for the ANOVA model. Could you help out?

juliasilge commented 2 years ago

@mounta74 If you ever run into trouble with a racing function like this, I recommend trying to just plain fit the workflow on your training data, or use tune_grid(). You will likely get a better understanding of where the problems are.

Krusenstierna commented 2 years ago

For me the "Error: There were no valid metrics for the ANOVA model." is isolated to the use of XGBoost. The racing anova works fine with Lasso, Random Forest, SVM, Neural Network and more, but not with XGBoost. The XGBoost works fine with tune_grid (regular grids and space filling variants such as Latin Hypercube approaches) but when replacing same code with racing anova it always generate "Error: There were no valid metrics for the ANOVA model."

juliasilge commented 2 years ago

@Krusenstierna Can you create a reprex (a minimal reproducible example) demonstrating your problem, and then post it on RStudio Community? The goal of a reprex is to make it easier for folks to recreate your problem so that we can understand it and/or fix it.

If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:

install.packages("reprex")

Thanks! 🙌

aakbarie commented 2 years ago

Hi Julia, thank you so much for all the blogs. Could you make a blog on using SHAP with tidyverse as an explainer for each individual outcome, like JJ did with tensorflow/keras model and lime about four years ago using the churn dataset

juliasilge commented 2 years ago

@aakbarie You may be interested in the explainability chapter of our tidymodels book.

aakbarie commented 2 years ago

Perfect, thank you

mounta74 commented 2 years ago

I used tune_grid() instead with pkgs = "stringr" in control_grid() and it did work fine with almost the same result you have found. What matters most for me in this case is being able to deploy SHAPforxgboost package for model interpretability.... this package is more useful than vip package. Thank you.

juliasilge commented 2 years ago

@mounta74 You can also pass a control object to a racing function, with pkgs and other tuning controls FYI.

IForberg commented 2 years ago

I got the same error as several others have noted here even after copying and pasting the code, and using the latest packages. When I skipped running "doParallel::registerDoParallel()" it took forever but it worked. Weird.

juliasilge commented 2 years ago

@IForberg and others who are having trouble with the parallel processing, I suspect that you may be on Windows machines where setting up parallel workers is different than on Linux/macOS. The big issue with this blog post is that there is a non-tidymodels package (stringr) that you need to pass to the parallel workers. In Linux/macOS this happens without us having to do anything, but with PSOCK clusters, you need to pass the extra package in as an argument to the tuning control. If you are using tune_grid(), that means control_grid(); if you are using a racing function, you need to pass in a control_race() object.

ghost commented 2 years ago

Hello please I am trying to implemet Xgboost on another different project, but when I got to

game_shap <- shap.prep( xgb_model = extract_fit_engine(xgb_fit), X_train = bake(game_prep, has_role("predictor"), new_data = NULL, composition = "matrix" ) )

I kept getting this error " Error in convert_matrix():

5 columns are not numeric; cannot convert to matrix. Run rlang::last_error() to see where the error occurred.

juliasilge commented 2 years ago

@DetaDao Hmmmm, it seems like your recipe results in columns that are not numeric, which I think wouldn't work with xgboost to start with. It's hard to tell with so little information, though. If you can create a reprex demonstrating your problem, I would recommend posting to RStudio Community.

conlelevn commented 1 year ago

@juliasilge I have the same error as people above and I try to create reprex and post it to R community but it said "Sorry, new users can only put one embedded media item in a post." :( although I only paste the reprex code in the comment box? how can I do now

conlelevn commented 1 year ago

@juliasilge sorry I have figure it out by removing the graph in reprex :) thanks

izzydi commented 1 year ago

Hi Julia thank you very much for your work ! I'm learning a lot from you! I have one question regarding "last_fit" function. I have a dataset which i split it into train and test set like this:

seed = 1821 set.seed(seed)

df_split <- initial_split( df, prop = .75, strata = Target)

train_set <- training(df_split) test_set <- testing(df_split)

I use the train set to make all my analysis and keep the test set just for model evaluation. My analysis showed me that i have some outliers so i decided to impute them by using dlookr package.

train_set <- train_set %>% mutate(time_in_hospital = imputate_outlier(train_set, time_in_hospital, method = "mean")) %>% mutate(n_lab_procedures = imputate_outlier(train_set, n_lab_procedures, method = "mean")) %>% mutate(n_procedures = imputate_outlier(train_set, n_procedures, method = "capping")) %>% mutate(n_medications = imputate_outlier(train_set, n_medications, method = "mode")) %>% mutate(n_outpatient = imputate_outlier(train_set n_outpatient, method = "capping"))

I made a random forest model with a recipe

model

rf_spec <- rand_forest( trees = 1000, mtry = 11, min_n = 11 ) %>% set_engine("ranger",importance = "permutation") %>% set_mode("classification")

workflow

rf_wf <- workflow() %>% add_recipe(new_recipe) %>% add_model(rf_spec)

And now i want to fit the model by using the last fit function like this

rf_model <- last_fit(rf_wf, df_split)

the problem is that , i think last_fit will take the train set from the first split i made and not the one in which i made the imputations.I know how to do it without using last_fit but my question is , if there is a way to update the training set for the last fit to use it. I hope i was clear. Than you again for your great help! Have a nice day!

izzydi commented 1 year ago

Sorry for the second comment i forgot to write the recipe i used.

basic_recipe <- recipe(Target ~ .,data = train_set) # the train_set here is after the imputation

new_recipe <- basic_recipe %>% step_impute_mode(all_nominal_predictors()) %>% step_zv(all_predictors()) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_numeric_predictors())

Now i'm thinking , because of the workflow in random forest (rf_wf) and because i have add in the "new_recipe" the last fit function will use the train set i used to make the recipe?If yes then it's all good.

juliasilge commented 1 year ago

@izzydi I may not be entirely understanding what you're asking, but I think you have two main choices. You can impute your outliers using imputate_outlier() before splitting into training and testing. This introduces the possibility of data leakage, because you won't have any ability to understand how this step is impacting your model performance. In some cases it may turn out fine, but it's not good statistical practice. Alternatively, you can wrap up the imputate_outlier() function in a recipe step, like this.

If you want some more guidance on this, I recommend that you create a reprex (a minimal reproducible example) showing more concretely what you mean. The goal of a reprex is to make it easier for us to recreate your problem/question so that we can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. Thanks! 🙌

izzydi commented 1 year ago
# Libraries
suppressPackageStartupMessages({library(tidyverse)
                                library(tidymodels)
                                library(recipes)
                                library(dlookr)
                                library(janitor)
                                library(cowplot)
                                library(vip)  
                                library(reprex)})
#> Warning: package 'tidyr' was built under R version 4.0.5
#> Warning: package 'workflowsets' was built under R version 4.0.5
#> Warning: package 'cowplot' was built under R version 4.0.5
#> Warning: package 'vip' was built under R version 4.0.5

Created on 2023-02-11 with reprex v2.0.2

readmissions <- read_csv('hospital_readmissions.csv',
                         show_col_types = FALSE,
                         na = c("Missing")) 
#> Error in read_csv("hospital_readmissions.csv", show_col_types = FALSE, : could not find function "read_csv"

df <- readmissions %>%
 select(-c("medical_specialty")) %>% # remove medical specialty
 rename(Target = readmitted) %>% # rename readmitted
 mutate_if(is.character, factor)  # strings to factors
#> Error in readmissions %>% select(-c("medical_specialty")) %>% rename(Target = readmitted) %>% : could not find function "%>%"

seed = 1821
set.seed(seed)

df_split <- 
  initial_split(
  df,
  prop = .75, # we keep 75% for the train set and 25% for the test set
  strata = Target)
#> Error in initial_split(df, prop = 0.75, strata = Target): could not find function "initial_split"

train_set <- training(df_split)
#> Error in training(df_split): could not find function "training"
test_set <- testing(df_split)
#> Error in testing(df_split): could not find function "testing"

basic_recipe <- recipe(Target ~ .,data = train_set)
#> Error in recipe(Target ~ ., data = train_set): could not find function "recipe"
rcp <- basic_recipe %>%
 step_impute_mode(all_nominal_predictors()) # impute mode
#> Error in basic_recipe %>% step_impute_mode(all_nominal_predictors()): could not find function "%>%"

train_set <-
 rcp %>%
 prep() %>%
 bake(new_data = NULL)
#> Error in rcp %>% prep() %>% bake(new_data = NULL): could not find function "%>%"

train_set<- train_set %>%
  mutate(time_in_hospital = imputate_outlier(train_set, time_in_hospital, method = "mean")) %>%
  mutate(n_lab_procedures = imputate_outlier(train_set, n_lab_procedures, method = "mean")) %>%
  mutate(n_procedures = imputate_outlier(train_set, n_procedures, method = "capping")) %>%
  mutate(n_medications = imputate_outlier(train_set, n_medications, method = "mode")) %>%
  mutate(n_outpatient = imputate_outlier(train_set, n_outpatient, method = "capping"))
#> Error in train_set %>% mutate(time_in_hospital = imputate_outlier(train_set, : could not find function "%>%"

basic_recipe <- recipe(Target ~ .,data = train_set) # train_set after imputation of ouliers
#> Error in recipe(Target ~ ., data = train_set): could not find function "recipe"

new_recipe <- 
 basic_recipe %>%
 step_impute_mode(all_nominal_predictors()) %>%
 step_zv(all_predictors()) %>%
 step_normalize(all_numeric_predictors()) %>%
 step_dummy(all_nominal_predictors()) 
#> Error in basic_recipe %>% step_impute_mode(all_nominal_predictors()) %>% : could not find function "%>%"

# Random forest model
rf_spec <- rand_forest(
trees = 1000,
mtry = 3,
min_n = 5) %>%
set_engine("ranger",importance = "permutation") %>%
set_mode("classification")
#> Error in rand_forest(trees = 1000, mtry = 3, min_n = 5) %>% set_engine("ranger", : could not find function "%>%"

# workflow
rf_wf <-
workflow() %>%
add_recipe(new_recipe) %>%
add_model(rf_spec)
#> Error in workflow() %>% add_recipe(new_recipe) %>% add_model(rf_spec): could not find function "%>%"

# predictions
rf_model <- last_fit(rf_wf,df_split)
#> Error in last_fit(rf_wf, df_split): could not find function "last_fit"

# metrics
rf_model %>%
 collect_metrics()
#> Error in rf_model %>% collect_metrics(): could not find function "%>%"

Created on 2023-02-11 with reprex v2.0.2

izzydi commented 1 year ago

Hi Julia! I think i have done it. I used reprex now. My question is, the last_fit function which sets is using? Is using the train_set before the imputation of outliers or after ? And just to be sure the collect_metrics give us the performance of the model on the test set right ? Thank you so much!

update: I think i have the answer. The last fit is using the train set provided in the worklfow from the add_recipe(new_recipe) step so it is the train set after the imputation of the outliers. Thanks!!

juliasilge commented 1 year ago

@izzydi Hmmm, I am seeing a lot of errors in the reprex and I am unable to run it. You may want to check out some tips for creating a reprex that can help people understand your question/problem. Once you have a reprex, it might be better to post on RStudio Community than blog comments; it's a great forum for getting help with these kinds of modeling questions.

JorgeRuRe commented 1 year ago

Hi Julia! Thank you for the lesson. Can the top 10 variables be plotted for the first Shapley values plot? I have a df with more than 300 covariates, and plotting all of them is inconvenient.

juliasilge commented 1 year ago

@JorgeRuRe I think you can use the top_n argument for shap.prep(): https://liuyanguu.github.io/SHAPforxgboost/reference/shap.prep.html

JorgeRuRe commented 1 year ago

@JorgeRuRe I think you can use the top_n argument for shap.prep(): https://liuyanguu.github.io/SHAPforxgboost/reference/shap.prep.html

Thank you, Julia!

jlecornu3 commented 1 year ago

Hi Julia - nice video and post, especially the interaction plots at the end for non-linear effects

Struggling to integrate an xgboost model trained in tidymodels with the accumulated local effects model interpretation method found in ALEPlot - any experience or ideas on this?

juliasilge commented 1 year ago

@jlecornu3 Hmmm, no, I haven't. I recommend that you create a reprex (a minimal reproducible example) outlining the trouble you are having and then posting on RStudio Community. It's a great forum for getting help with these kinds of modeling questions, and maybe someone else there has tried it. Good luck! 🙌