PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes

tanthiamhuat commented 2 years ago

Hello Julia Silge,

Thank you for your notes on PCA with tidymodels. I have a few questions: a) First, we must tell the recipe() what’s going on with our model (notice the formula with no outcome) and what data we are using. pca_rec <- recipe(~., data = cocktails_df) If I use the above with no outcome, when we do the tuning, it would have issue as below:

`pca_rec <- recipe(~., data = data_train) %>% step_YeoJohnson(all_numeric()) %>% step_normalize(all_numeric()) %>% step_pca(all_numeric_predictors(),num_comp = 5)

rf_model <- rand_forest() %>% set_args(mtry = tune()) %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")

rf_wflow <- workflow() %>% add_recipe(pca_rec) %>%
add_model(rf_model)

rf_grid <- expand.grid(mtry = c(3, 4, 5))

rf_tune_results <- rf_wflow %>% tune_grid(resamples = data_cv, grid = rf_grid, metrics = metric_set(accuracy, roc_auc) ) Fold1: preprocessor 1/1, model 1/3: Error in y.mat[, 2]: subscript out of bounds Fold1: preprocessor 1/1, model 2/3: Error in y.mat[, 2]: subscript out of bounds Fold1: preprocessor 1/1, model 3/3: Error in y.mat[, 2]: subscript out of bounds Fold2: preprocessor 1/1, model 1/3: Error in y.mat[, 2]: subscript out of bounds Fold2: preprocessor 1/1, model 2/3: Error in y.mat[, 2]: subscript out of bounds Fold2: preprocessor 1/1, model 3/3: Error in y.mat[, 2]: subscript out of bounds Fold3: preprocessor 1/1, model 1/3: Error in y.mat[, 2]: subscript out of bounds Fold3: preprocessor 1/1, model 2/3: Error in y.mat[, 2]: subscript out of bounds Fold3: preprocessor 1/1, model 3/3: Error in y.mat[, 2]: subscript out of bounds Fold4: preprocessor 1/1, model 1/3: Error in y.mat[, 2]: subscript out of bounds Fold4: preprocessor 1/1, model 2/3: Error in y.mat[, 2]: subscript out of bounds Fold4: preprocessor 1/1, model 3/3: Error in y.mat[, 2]: subscript out of bounds Fold5: preprocessor 1/1, model 1/3: Error in y.mat[, 2]: subscript out of bounds Fold5: preprocessor 1/1, model 2/3: Error in y.mat[, 2]: subscript out of bounds Fold5: preprocessor 1/1, model 3/3: Error in y.mat[, 2]: subscript out of bounds Warning message: All models failed. See the .notes column. `

b) Secondly, I am still trying to understand and interpret the plots of Would you mind explain a bit more?

c) Third, I often encounter errors on the metric sets with below: `rf_tune_results <- rf_wflow %>% tune_grid(resamples = data_cv, grid = rf_grid, metrics = metric_set(recall,precision,accuracy,roc_auc) ) Error: The combination of metric functions must be:

only numeric metrics
a mix of class metrics and class probability metrics

The following metric function types are being mixed:

other (recall , precision )
class (accuracy)
prob (roc_auc) ` Often, I have to only have metrics = metric_set(accuracy,roc_auc), then the error is removed.

juliasilge commented 2 years ago

On 1) if you are going to tune your number of components to optimize some outcome, then you do indeed need to say what that outcome is, like recipe(my_outcome_here ~ ., data = data_train). The cocktails data doesn't have an outcome, because it is unsupervised.

On 2) I'll point you to my rstudio::conf talk from a few years ago.

On 3) you have caret loaded so the functions from caret are overriding the ones from yardstick/tidymodels. I'd suggest not loading caret, try using the namespace explicitly like yardstick::precision, using the function tidymodels_prefer(), or setting up your own use of the conflicted package.

tanthiamhuat commented 2 years ago

Thanks Julia Silge, I always look forward to your answers and I learn a lot from you, from your blogs, etc. Yes, those 3 questions are clearly explained and answered and I am much clearer on PCA intepretations now. Indeed, your example on PCA here https://juliasilge.com/blog/cocktail-recipes-umap/ explains it well when I have a second read.

Another example which I am following from here: https://www.kirenz.com/post/2021-02-17-r-classification-tidymodels/ which I simply could not get the correct values as what the author has presented. Of course I contacted him, but he is saying there is no issue on his side. I do not know why my "kap" and "spec" are so low, compared to his, and we are using the same data and code. And due to that, my ROC Curve looks weird and different from his too. Would you mind take a quick look? Thanks.

log_res %>% collect_metrics(summarize = TRUE)

.metric .estimator mean n std_err .config

1 accuracy binary 0.797 5 0.000913 Preprocessor1_Model1 2 f_meas binary 0.887 5 0.000579 Preprocessor1_Model1 3 kap binary **-0.0118** 5 0.00206 Preprocessor1_Model1 4 precision binary 0.807 5 0.000177 Preprocessor1_Model1 5 recall binary 0.984 5 0.00134 Preprocessor1_Model1 6 roc_auc binary 0.723 5 0.00316 Preprocessor1_Model1 7 sens binary 0.984 5 0.00134 Preprocessor1_Model1 8 spec binary **0.00841** 5 0.00141 Preprocessor1_Model1

juliasilge / juliasilge.com

PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes #49