tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
564 stars 111 forks source link

PLS loadings from step_pls() differ from plsda() ones #1036

Open abichat opened 2 years ago

abichat commented 2 years ago

The problem

I tried to compare results of a PLS dimension reduction with recipes::step_pls() and mixOmics::plsda() (at the begining, to recreate a importance plot like mixOmics::plotLoadings() in a tidy way), but I encountered this issue. Although coordinates in the embed space are the same (machine epsilon difference), variables loading -also called importance- differ for components 2 and beyond (see reprex).

I'm surprised because step_pls() is supposed to be a wrapper around plsda(). I recognize that the difference is not much for iris, but is significant. Do you have any idea where it comes from? I saw that recipes quantities are recomputed, so maybe it's another close quantity that is computed?

Thanks for your your time and all your work on tidymodels!

Reproducible example

library(dplyr)
library(tidyr)
library(recipes)
library(mixOmics)

## Computation

mix_plsda <- plsda(X = iris[-5], Y = iris$Species, ncomp = 3)
tidy_plsda <-
  recipe(Species ~ ., data = iris) %>% 
  step_pls(all_numeric_predictors(), outcome = "Species", num_comp = 3) %>% 
  prep()

## Projection

mix_proj <- as_tibble(mix_plsda$variates$X)
tidy_proj <- bake(tidy_plsda, new_data = NULL)

max(abs(tidy_proj[, -1] - mix_proj))
#> [1] 1.998401e-15

## Loadings

mix_load <- 
  as_tibble(mix_plsda$loadings$X, rownames = "terms") %>% 
  pivot_longer(-terms, names_to = "component", 
               names_prefix = "comp", values_to = "value")
tidy_loadings <- tidy(tidy_plsda, 1)

mix_load 
#> # A tibble: 12 × 3
#>    terms        component    value
#>    <chr>        <chr>        <dbl>
#>  1 Sepal.Length 1         -0.471  
#>  2 Sepal.Length 2         -0.274  
#>  3 Sepal.Length 3         -0.763  
#>  4 Sepal.Width  1          0.315  
#>  5 Sepal.Width  2         -0.930  
#>  6 Sepal.Width  3          0.0558 
#>  7 Petal.Length 1         -0.586  
#>  8 Petal.Length 2         -0.0377 
#>  9 Petal.Length 3          0.00780
#> 10 Petal.Width  1         -0.579  
#> 11 Petal.Width  2         -0.244  
#> 12 Petal.Width  3          0.644
tidy_loadings
#> # A tibble: 12 × 4
#>    terms          value component id       
#>    <chr>          <dbl> <chr>     <chr>    
#>  1 Sepal.Length -0.471  PLS1      pls_aj8g7
#>  2 Sepal.Length -0.258  PLS2      pls_aj8g7
#>  3 Sepal.Length -0.701  PLS3      pls_aj8g7
#>  4 Sepal.Width   0.315  PLS1      pls_aj8g7
#>  5 Sepal.Width  -0.940  PLS2      pls_aj8g7
#>  6 Sepal.Width   0.201  PLS3      pls_aj8g7
#>  7 Petal.Length -0.586  PLS1      pls_aj8g7
#>  8 Petal.Length -0.0177 PLS2      pls_aj8g7
#>  9 Petal.Length  0.0338 PLS3      pls_aj8g7
#> 10 Petal.Width  -0.579  PLS1      pls_aj8g7
#> 11 Petal.Width  -0.224  PLS2      pls_aj8g7
#> 12 Petal.Width   0.704  PLS3      pls_aj8g7

mix_load %>% 
  transmute(terms, component = paste0("PLS", component),
            value_mix = value) %>% 
  left_join(dplyr::select(tidy_loadings, terms, component, value_tidy = value), 
            by = c("terms", "component")) %>% 
  ggplot() +
  aes(x = value_mix, y = value_tidy, color = component) +
  geom_abline(linetype = "dashed", color = "grey80") +
  geom_point(size = 3) +
  labs(x = "Loadings in mixOmics",
       y = "Loadings in tidymodels",
       color = NULL)

Created on 2022-09-14 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.1.3 (2022-03-10) #> os Ubuntu 18.04.6 LTS #> system x86_64, linux-gnu #> ui X11 #> language en_US:en #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Paris #> date 2022-09-14 #> pandoc 2.18 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.1.3) #> BiocParallel 1.28.3 2021-12-09 [3] Bioconductor #> class 7.3-20 2022-01-16 [3] CRAN (R 4.1.3) #> cli 3.3.0 2022-04-25 [2] CRAN (R 4.1.2) #> codetools 0.2-18 2020-11-04 [3] CRAN (R 4.1.3) #> colorspace 2.0-3 2022-02-21 [2] CRAN (R 4.1.3) #> corpcor 1.6.10 2021-09-16 [2] CRAN (R 4.1.2) #> curl 4.3.2 2021-06-23 [3] CRAN (R 4.1.3) #> DBI 1.1.3 2022-06-18 [2] CRAN (R 4.1.3) #> digest 0.6.29 2021-12-01 [2] CRAN (R 4.1.3) #> dplyr * 1.0.10 2022-09-01 [2] CRAN (R 4.1.3) #> ellipse 0.4.3 2022-05-31 [3] CRAN (R 4.1.3) #> ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.1.3) #> evaluate 0.16 2022-08-09 [2] CRAN (R 4.1.3) #> fansi 1.0.3 2022-03-24 [2] CRAN (R 4.1.2) #> farver 2.1.1 2022-07-06 [1] CRAN (R 4.1.3) #> fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.1.3) #> fs 1.5.2 2021-12-08 [2] CRAN (R 4.1.3) #> future 1.27.0 2022-07-22 [3] CRAN (R 4.1.3) #> future.apply 1.9.0 2022-04-25 [3] CRAN (R 4.1.3) #> generics 0.1.3 2022-07-05 [2] CRAN (R 4.1.3) #> ggplot2 * 3.3.6 2022-05-03 [2] CRAN (R 4.1.2) #> ggrepel 0.9.1 2021-01-15 [3] CRAN (R 4.1.3) #> globals 0.15.1 2022-06-24 [3] CRAN (R 4.1.3) #> glue 1.6.2 2022-02-24 [2] CRAN (R 4.1.3) #> gower 1.0.0 2022-02-03 [3] CRAN (R 4.1.3) #> gridExtra 2.3 2017-09-09 [3] CRAN (R 4.1.3) #> gtable 0.3.1 2022-09-01 [2] CRAN (R 4.1.3) #> hardhat 1.2.0 2022-06-30 [1] CRAN (R 4.1.3) #> highr 0.9 2021-04-16 [2] CRAN (R 4.1.3) #> htmltools 0.5.3 2022-07-18 [2] CRAN (R 4.1.3) #> httr 1.4.4 2022-08-17 [2] CRAN (R 4.1.3) #> igraph 1.3.4 2022-07-19 [2] CRAN (R 4.1.3) #> ipred 0.9-13 2022-06-02 [3] CRAN (R 4.1.3) #> knitr 1.40 2022-08-24 [1] CRAN (R 4.1.3) #> labeling 0.4.2 2020-10-20 [2] CRAN (R 4.1.3) #> lattice * 0.20-45 2021-09-22 [3] CRAN (R 4.1.3) #> lava 1.6.10 2021-09-02 [3] CRAN (R 4.1.3) #> lifecycle 1.0.1 2021-09-24 [2] CRAN (R 4.1.3) #> listenv 0.8.0 2019-12-05 [3] CRAN (R 4.1.3) #> lubridate 1.8.0 2021-10-07 [2] CRAN (R 4.1.3) #> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.1.2) #> MASS * 7.3-55 2022-01-16 [3] CRAN (R 4.1.3) #> Matrix 1.4-0 2021-12-08 [3] CRAN (R 4.1.3) #> matrixStats 0.62.0 2022-04-19 [3] CRAN (R 4.1.3) #> mime 0.12 2021-09-28 [2] CRAN (R 4.1.3) #> mixOmics * 6.18.1 2022-04-05 [2] bioc_git2r (@5ef4960) #> munsell 0.5.0 2018-06-12 [2] CRAN (R 4.1.3) #> nnet 7.3-17 2022-01-16 [3] CRAN (R 4.1.3) #> parallelly 1.32.1 2022-07-21 [3] CRAN (R 4.1.3) #> pillar 1.8.1 2022-08-19 [2] CRAN (R 4.1.3) #> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.1.3) #> plyr 1.8.7 2022-03-24 [2] CRAN (R 4.1.2) #> prodlim 2019.11.13 2019-11-17 [3] CRAN (R 4.1.3) #> purrr 0.3.4 2020-04-17 [2] CRAN (R 4.1.3) #> R6 2.5.1 2021-08-19 [2] CRAN (R 4.1.3) #> rARPACK 0.11-0 2016-03-10 [2] CRAN (R 4.1.2) #> RColorBrewer 1.1-3 2022-04-03 [2] CRAN (R 4.1.2) #> Rcpp 1.0.9 2022-07-08 [2] CRAN (R 4.1.3) #> recipes * 1.0.1 2022-07-07 [1] CRAN (R 4.1.3) #> reprex 2.0.2 2022-08-17 [2] CRAN (R 4.1.3) #> reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.1.3) #> rlang 1.0.4 2022-07-12 [1] CRAN (R 4.1.3) #> rmarkdown 2.16 2022-08-24 [1] CRAN (R 4.1.3) #> rpart 4.1.16 2022-01-24 [3] CRAN (R 4.1.3) #> RSpectra 0.16-1 2022-04-24 [1] CRAN (R 4.1.3) #> rstudioapi 0.14 2022-08-22 [2] CRAN (R 4.1.3) #> scales 1.2.1 2022-08-20 [2] CRAN (R 4.1.3) #> sessioninfo 1.2.2 2021-12-06 [3] CRAN (R 4.1.3) #> stringi 1.7.8 2022-07-11 [2] CRAN (R 4.1.3) #> stringr 1.4.1 2022-08-20 [2] CRAN (R 4.1.3) #> survival 3.2-13 2021-08-24 [3] CRAN (R 4.1.3) #> tibble 3.1.8 2022-07-22 [2] CRAN (R 4.1.3) #> tidyr * 1.2.0 2022-02-01 [2] CRAN (R 4.1.3) #> tidyselect 1.1.2 2022-02-21 [2] CRAN (R 4.1.3) #> timeDate 4021.104 2022-07-19 [3] CRAN (R 4.1.3) #> utf8 1.2.2 2021-07-24 [2] CRAN (R 4.1.3) #> vctrs 0.4.1 2022-04-13 [2] CRAN (R 4.1.2) #> withr 2.5.0 2022-03-03 [2] CRAN (R 4.1.3) #> xfun 0.32 2022-08-10 [1] CRAN (R 4.1.3) #> xml2 1.3.3 2021-11-30 [2] CRAN (R 4.1.3) #> yaml 2.3.5 2022-02-21 [2] CRAN (R 4.1.3) #> #> [1] /home/ATBT_CB/R/IMG_R_4.1/4.1.3 #> [2] /home/ATBT_CB/R/x86_64-pc-linux-gnu-library/4.1 #> [3] /opt/R/4.1.3/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
EmilHvitfeldt commented 2 years ago

Hello @abichat 👋

You are correct that something is slightly off there. I have confirmed that the model that is fit inside step_pls() is identical to the model being fit by plsda(). Something is happening (it might just be some bad rounding) with how we are transforming the information that is causing this.

topepo commented 2 years ago

I don't think that we are comparing the same things. step_pls() is trying to return the regression coefficients relating the predictors to the outcome(s), as opposed to quantities like the loadings.

I believe that mix_plsda$variates$X are the rotation for $X$. But PLS does an indirect regression of $X$ on $Y$. To get that, we need to use the $R$ matrix, which is $R = W(P'W)^{-1}$ (a clear reference for this is here). This $R$ is $p \times m$ where $p$ is the number of predictors and $m$ is the number of components.

Those calculations are done in the butcher_pls function. We try not to save the entire mixOmics object in the recipe.

When we have new data $X^$ (say $n \times p$), bake() produces $X^R$ which is $n \times m$.

(edit for clarity)

abichat commented 1 year ago

Thanks for your answer and sorry for the delay.

I understand your point of returning only the preprocessing/projection output of step_pls() and not the entire mixOmics object. However, as tidymodels is also use for dimension reduction alone, returning information about how well it has worked and how we can interpret it is useful (loadings or proportion of the variance explained #1038) . Maybe it could be considered as an optional argument, like importance for rand_forest()?

Concerning mix_plsda$variates$X, I choose it because it's the one returned by plotLoadings and it's called importance in the documentation. Another one could be more appropriate for variable contribution to the axis, my linear algebra is a bit patchy.