juliasilge / juliasilge.com

My blog, built with blogdown and Hugo :link:
https://juliasilge.com/
40 stars 27 forks source link

PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes | Julia Silge #40

Open utterances-bot opened 2 years ago

utterances-bot commented 2 years ago

PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes | Julia Silge

Use tidymodels for unsupervised dimensionality reduction.

https://juliasilge.com/blog/cocktail-recipes-umap/

factorialmap commented 2 years ago

Thank you so much Julia. I think this video and content is great as intuitive explanation of PCA and how to implement and visualize it well in RStudio.

portolan75 commented 2 years ago

Hi Julia, this tidy workflow is very interesting and I am using it more and more. I also tried the UMAP workflow, but how to predict umap coordinates on a new set of data? In your example if I bake umap_prep on a different dataset (with the same variables) does not work, neither using standard 'predict' function. Am I doing something wrong or is not possible to predict/bake on a new set?

juliasilge commented 2 years ago

@portolan75 Is it this problem that you are seeing? Or something else?

If it is something else, then I suggest that you create a reprex (a minimal reproducible example) for the problem you are observing, and post it on RStudio Community. The goal of a reprex is to make it easier for us to recreate your problem so that others can understand it.

If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:

install.packages("reprex")

Thanks! 🙌

portolan75 commented 2 years ago

Hi @juliasilge , thanks for your answer. In reality after your comment I tried again and realised I did something wrong with my dataset and was not able to 'predict' - bake on the test set. So I was having good results on the training set but not able to bake the umap coeffs for the test set. Anyway it worked, thanks for the attention and also for re-directing to the other 'CppMethod' problem which turned useful as well.

Averysaurus commented 2 years ago

All this work is so brilliant @juliasilge. Are there are any literature, book chapters, articles, videos on PCA interpretation you can recommend?

juliasilge commented 2 years ago
Kasramhdz commented 2 years ago

Thank you for the fantastic tutorial but I have a question, how can we change the rotation method applied to the step_pca?

juliasilge commented 2 years ago

@Kasramhdz The step_pca() function uses stats::prcomp() under the hood, which I don't believe supports that, but you can get out the loadings using tidy() and the type = "coef" argument and then apply a rotation yourself. See this Cross Validated answer for more info.

Kasramhdz commented 2 years ago

I have another question, I'm new to tidymodels but apparently the step_pca() arguments such as nom_comp or threshold are not being implemented when being trained. as in example below, I'm still getting 4 component despite settingnom_comp = 2.

rec <- recipe( ~ ., data = USArrests) %>% step_normalize(all_numeric()) %>% step_pca(all_numeric(), num_comp = 2)

prep(rec) %>% tidy(number = 2, type = "coef") %>% pivot_wider(names_from = component, values_from = value, id_cols = terms)

juliasilge commented 2 years ago

@Kasramhdz The full PCA is determined (so you can still compute the variances of each term) and num_comp specifies how many of the components are retained as predictors. If you want to specify the maximal rank, you can pass that through options:

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
rec <- recipe( ~ ., data = USArrests) %>%
    step_normalize(all_numeric()) %>%
    step_pca(all_numeric(), num_comp = 2, options = list(rank. = 2))

prep(rec) %>% tidy(number = 2, type = "coef")
#> # A tibble: 8 × 4
#>   terms     value component id       
#>   <chr>     <dbl> <chr>     <chr>    
#> 1 Murder   -0.536 PC1       pca_T11OM
#> 2 Assault  -0.583 PC1       pca_T11OM
#> 3 UrbanPop -0.278 PC1       pca_T11OM
#> 4 Rape     -0.543 PC1       pca_T11OM
#> 5 Murder    0.418 PC2       pca_T11OM
#> 6 Assault   0.188 PC2       pca_T11OM
#> 7 UrbanPop -0.873 PC2       pca_T11OM
#> 8 Rape     -0.167 PC2       pca_T11OM

Created on 2022-01-12 by the reprex package (v2.0.1)

You could also control this via the tol argument.