Open abichat opened 2 years ago
Hello @abichat 👋
You are correct that something is slightly off there. I have confirmed that the model that is fit inside step_pls()
is identical to the model being fit by plsda()
. Something is happening (it might just be some bad rounding) with how we are transforming the information that is causing this.
I don't think that we are comparing the same things. step_pls()
is trying to return the regression coefficients relating the predictors to the outcome(s), as opposed to quantities like the loadings.
I believe that mix_plsda$variates$X
are the rotation for $X$. But PLS does an indirect regression of $X$ on $Y$. To get that, we need to use the $R$ matrix, which is $R = W(P'W)^{-1}$ (a clear reference for this is here). This $R$ is $p \times m$ where $p$ is the number of predictors and $m$ is the number of components.
Those calculations are done in the butcher_pls
function. We try not to save the entire mixOmics object in the recipe.
When we have new data $X^$ (say $n \times p$), bake()
produces $X^R$ which is $n \times m$.
(edit for clarity)
Thanks for your answer and sorry for the delay.
I understand your point of returning only the preprocessing/projection output of step_pls()
and not the entire mixOmics object. However, as tidymodels
is also use for dimension reduction alone, returning information about how well it has worked and how we can interpret it is useful (loadings or proportion of the variance explained #1038) . Maybe it could be considered as an optional argument, like importance
for rand_forest()
?
Concerning mix_plsda$variates$X
, I choose it because it's the one returned by plotLoadings
and it's called importance in the documentation. Another one could be more appropriate for variable contribution to the axis, my linear algebra is a bit patchy.
The problem
I tried to compare results of a PLS dimension reduction with
recipes::step_pls()
andmixOmics::plsda()
(at the begining, to recreate a importance plot likemixOmics::plotLoadings()
in a tidy way), but I encountered this issue. Although coordinates in the embed space are the same (machine epsilon difference), variables loading -also called importance- differ for components 2 and beyond (see reprex).I'm surprised because
step_pls()
is supposed to be a wrapper aroundplsda()
. I recognize that the difference is not much foriris
, but is significant. Do you have any idea where it comes from? I saw that recipes quantities are recomputed, so maybe it's another close quantity that is computed?Thanks for your your time and all your work on
tidymodels
!Reproducible example
Created on 2022-09-14 with reprex v2.0.2
Session info
``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.1.3 (2022-03-10) #> os Ubuntu 18.04.6 LTS #> system x86_64, linux-gnu #> ui X11 #> language en_US:en #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Paris #> date 2022-09-14 #> pandoc 2.18 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.1.3) #> BiocParallel 1.28.3 2021-12-09 [3] Bioconductor #> class 7.3-20 2022-01-16 [3] CRAN (R 4.1.3) #> cli 3.3.0 2022-04-25 [2] CRAN (R 4.1.2) #> codetools 0.2-18 2020-11-04 [3] CRAN (R 4.1.3) #> colorspace 2.0-3 2022-02-21 [2] CRAN (R 4.1.3) #> corpcor 1.6.10 2021-09-16 [2] CRAN (R 4.1.2) #> curl 4.3.2 2021-06-23 [3] CRAN (R 4.1.3) #> DBI 1.1.3 2022-06-18 [2] CRAN (R 4.1.3) #> digest 0.6.29 2021-12-01 [2] CRAN (R 4.1.3) #> dplyr * 1.0.10 2022-09-01 [2] CRAN (R 4.1.3) #> ellipse 0.4.3 2022-05-31 [3] CRAN (R 4.1.3) #> ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.1.3) #> evaluate 0.16 2022-08-09 [2] CRAN (R 4.1.3) #> fansi 1.0.3 2022-03-24 [2] CRAN (R 4.1.2) #> farver 2.1.1 2022-07-06 [1] CRAN (R 4.1.3) #> fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.1.3) #> fs 1.5.2 2021-12-08 [2] CRAN (R 4.1.3) #> future 1.27.0 2022-07-22 [3] CRAN (R 4.1.3) #> future.apply 1.9.0 2022-04-25 [3] CRAN (R 4.1.3) #> generics 0.1.3 2022-07-05 [2] CRAN (R 4.1.3) #> ggplot2 * 3.3.6 2022-05-03 [2] CRAN (R 4.1.2) #> ggrepel 0.9.1 2021-01-15 [3] CRAN (R 4.1.3) #> globals 0.15.1 2022-06-24 [3] CRAN (R 4.1.3) #> glue 1.6.2 2022-02-24 [2] CRAN (R 4.1.3) #> gower 1.0.0 2022-02-03 [3] CRAN (R 4.1.3) #> gridExtra 2.3 2017-09-09 [3] CRAN (R 4.1.3) #> gtable 0.3.1 2022-09-01 [2] CRAN (R 4.1.3) #> hardhat 1.2.0 2022-06-30 [1] CRAN (R 4.1.3) #> highr 0.9 2021-04-16 [2] CRAN (R 4.1.3) #> htmltools 0.5.3 2022-07-18 [2] CRAN (R 4.1.3) #> httr 1.4.4 2022-08-17 [2] CRAN (R 4.1.3) #> igraph 1.3.4 2022-07-19 [2] CRAN (R 4.1.3) #> ipred 0.9-13 2022-06-02 [3] CRAN (R 4.1.3) #> knitr 1.40 2022-08-24 [1] CRAN (R 4.1.3) #> labeling 0.4.2 2020-10-20 [2] CRAN (R 4.1.3) #> lattice * 0.20-45 2021-09-22 [3] CRAN (R 4.1.3) #> lava 1.6.10 2021-09-02 [3] CRAN (R 4.1.3) #> lifecycle 1.0.1 2021-09-24 [2] CRAN (R 4.1.3) #> listenv 0.8.0 2019-12-05 [3] CRAN (R 4.1.3) #> lubridate 1.8.0 2021-10-07 [2] CRAN (R 4.1.3) #> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.1.2) #> MASS * 7.3-55 2022-01-16 [3] CRAN (R 4.1.3) #> Matrix 1.4-0 2021-12-08 [3] CRAN (R 4.1.3) #> matrixStats 0.62.0 2022-04-19 [3] CRAN (R 4.1.3) #> mime 0.12 2021-09-28 [2] CRAN (R 4.1.3) #> mixOmics * 6.18.1 2022-04-05 [2] bioc_git2r (@5ef4960) #> munsell 0.5.0 2018-06-12 [2] CRAN (R 4.1.3) #> nnet 7.3-17 2022-01-16 [3] CRAN (R 4.1.3) #> parallelly 1.32.1 2022-07-21 [3] CRAN (R 4.1.3) #> pillar 1.8.1 2022-08-19 [2] CRAN (R 4.1.3) #> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.1.3) #> plyr 1.8.7 2022-03-24 [2] CRAN (R 4.1.2) #> prodlim 2019.11.13 2019-11-17 [3] CRAN (R 4.1.3) #> purrr 0.3.4 2020-04-17 [2] CRAN (R 4.1.3) #> R6 2.5.1 2021-08-19 [2] CRAN (R 4.1.3) #> rARPACK 0.11-0 2016-03-10 [2] CRAN (R 4.1.2) #> RColorBrewer 1.1-3 2022-04-03 [2] CRAN (R 4.1.2) #> Rcpp 1.0.9 2022-07-08 [2] CRAN (R 4.1.3) #> recipes * 1.0.1 2022-07-07 [1] CRAN (R 4.1.3) #> reprex 2.0.2 2022-08-17 [2] CRAN (R 4.1.3) #> reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.1.3) #> rlang 1.0.4 2022-07-12 [1] CRAN (R 4.1.3) #> rmarkdown 2.16 2022-08-24 [1] CRAN (R 4.1.3) #> rpart 4.1.16 2022-01-24 [3] CRAN (R 4.1.3) #> RSpectra 0.16-1 2022-04-24 [1] CRAN (R 4.1.3) #> rstudioapi 0.14 2022-08-22 [2] CRAN (R 4.1.3) #> scales 1.2.1 2022-08-20 [2] CRAN (R 4.1.3) #> sessioninfo 1.2.2 2021-12-06 [3] CRAN (R 4.1.3) #> stringi 1.7.8 2022-07-11 [2] CRAN (R 4.1.3) #> stringr 1.4.1 2022-08-20 [2] CRAN (R 4.1.3) #> survival 3.2-13 2021-08-24 [3] CRAN (R 4.1.3) #> tibble 3.1.8 2022-07-22 [2] CRAN (R 4.1.3) #> tidyr * 1.2.0 2022-02-01 [2] CRAN (R 4.1.3) #> tidyselect 1.1.2 2022-02-21 [2] CRAN (R 4.1.3) #> timeDate 4021.104 2022-07-19 [3] CRAN (R 4.1.3) #> utf8 1.2.2 2021-07-24 [2] CRAN (R 4.1.3) #> vctrs 0.4.1 2022-04-13 [2] CRAN (R 4.1.2) #> withr 2.5.0 2022-03-03 [2] CRAN (R 4.1.3) #> xfun 0.32 2022-08-10 [1] CRAN (R 4.1.3) #> xml2 1.3.3 2021-11-30 [2] CRAN (R 4.1.3) #> yaml 2.3.5 2022-02-21 [2] CRAN (R 4.1.3) #> #> [1] /home/ATBT_CB/R/IMG_R_4.1/4.1.3 #> [2] /home/ATBT_CB/R/x86_64-pc-linux-gnu-library/4.1 #> [3] /opt/R/4.1.3/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```