rgcca-factory / RGCCA

https://rgcca-factory.github.io/RGCCA/
10 stars 11 forks source link

a and astar being equal #40

Closed llrs closed 2 years ago

llrs commented 2 years ago

Perhaps I'm not understanding correctly how RGCCA works but shouldn't the weights and a* (or astar on the output list) be different?

I found that currently they are the same for both rgcca and sgcca methods:

  library("RGCCA")
#> Loading required package: MASS

# Create the dataset
data(Russett)
blocks = list(agriculture = Russett[, seq(3)],
              industry = Russett[, 4:5],
              politic = Russett[, 6:11])

# Blocks are fully connected, factorial scheme and tau =1 for all blocks is
# used by default
fit.rgcca = rgcca(blocks=blocks, method = "rgcca", connection = 1-diag(3),
                  scheme = "factorial", tau = rep(1, 3))
summary(fit.rgcca$a$agriculture[, 1] == fit.rgcca$astar$agriculture[, 1])
#>    Mode    TRUE 
#> logical       3

Created on 2021-11-15 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> ─ Session info ────────────────────────────────────────────────────────────── #> hash: elf: medium skin tone, person walking: medium skin tone, call me hand: medium-light skin tone #> #> setting value #> version R version 3.6.3 (2020-02-29) #> os Ubuntu 20.04.3 LTS #> system x86_64, linux-gnu #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Madrid #> date 2021-11-15 #> pandoc 2.14.2 @ /usr/lib/rstudio/bin/quarto/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> ! package * version date (UTC) lib source #> 2 assertthat 0.2.1 2019-03-21 [?] CRAN (R 3.6.0) #> 2 backports 1.2.1 2020-12-09 [?] CRAN (R 3.6.3) #> 2 cli 3.1.0 2021-10-27 [?] CRAN (R 3.6.3) #> 2 colorspace 2.0-2 2021-06-24 [?] CRAN (R 3.6.3) #> 2 crayon 1.4.1 2021-02-08 [?] CRAN (R 3.6.3) #> 1 data.table 1.14.2 2021-09-27 [1] CRAN (R 3.6.3) #> 1 DBI 1.1.1 2021-01-15 [1] CRAN (R 3.6.3) #> 2 Deriv 4.1.3 2021-02-24 [?] CRAN (R 3.6.3) #> 1 digest 0.6.28 2021-09-23 [1] CRAN (R 3.6.3) #> 2 dplyr 1.0.7 2021-06-18 [?] CRAN (R 3.6.3) #> 2 ellipsis 0.3.2 2021-04-29 [?] CRAN (R 3.6.3) #> 2 evaluate 0.14 2019-05-28 [?] CRAN (R 3.6.0) #> 2 fansi 0.5.0 2021-05-25 [?] CRAN (R 3.6.3) #> 1 fastmap 1.1.0 2021-01-25 [1] CRAN (R 3.6.3) #> 2 fs 1.5.0 2020-07-31 [?] CRAN (R 3.6.3) #> 2 generics 0.1.0 2020-10-31 [?] CRAN (R 3.6.3) #> 2 ggplot2 3.3.5 2021-06-25 [?] CRAN (R 3.6.3) #> 2 glue 1.4.2 2020-08-27 [?] CRAN (R 3.6.3) #> 2 gridExtra 2.3 2017-09-09 [?] CRAN (R 3.6.0) #> 2 gtable 0.3.0 2019-03-25 [?] CRAN (R 3.6.0) #> 2 highr 0.9 2021-04-16 [?] CRAN (R 3.6.3) #> 2 htmltools 0.5.2 2021-08-25 [?] CRAN (R 3.6.3) #> 2 htmlwidgets 1.5.4 2021-09-08 [?] CRAN (R 3.6.3) #> 2 httr 1.4.2 2020-07-20 [?] CRAN (R 3.6.3) #> 2 jsonlite 1.7.2 2020-12-09 [?] CRAN (R 3.6.3) #> 1 knitr 1.36 2021-09-29 [1] CRAN (R 3.6.3) #> 2 lazyeval 0.2.2 2019-03-15 [?] CRAN (R 3.6.0) #> 1 lifecycle 1.0.1 2021-09-24 [1] CRAN (R 3.6.3) #> 2 magrittr 2.0.1 2020-11-17 [?] CRAN (R 3.6.3) #> 2 MASS * 7.3-54 2021-05-03 [3] CRAN (R 3.6.3) #> 2 munsell 0.5.0 2018-06-12 [?] CRAN (R 3.6.0) #> 1 pillar 1.6.4 2021-10-18 [1] CRAN (R 3.6.3) #> 2 pkgconfig 2.0.3 2019-09-22 [?] CRAN (R 3.6.1) #> 2 plotly 4.10.0 2021-10-09 [?] CRAN (R 3.6.3) #> 2 purrr 0.3.4 2020-04-17 [?] CRAN (R 3.6.3) #> 2 R.cache 0.15.0 2021-04-30 [?] CRAN (R 3.6.3) #> 2 R.methodsS3 1.8.1 2020-08-26 [?] CRAN (R 3.6.3) #> 2 R.oo 1.24.0 2020-08-26 [?] CRAN (R 3.6.3) #> 1 R.utils 2.11.0 2021-09-26 [1] CRAN (R 3.6.3) #> 2 R6 2.5.1 2021-08-19 [?] CRAN (R 3.6.3) #> 2 reprex 2.0.1 2021-08-05 [?] CRAN (R 3.6.3) #> 2 RGCCA * 3.0 2021-10-18 [?] Github (rgcca-factory/RGCCA@29583ba) #> 1 rlang 0.4.12 2021-10-18 [1] CRAN (R 3.6.3) #> 2 rmarkdown 2.11 2021-09-14 [?] CRAN (R 3.6.3) #> 2 rstudioapi 0.13 2020-11-12 [?] CRAN (R 3.6.3) #> 2 scales 1.1.1 2020-05-11 [?] CRAN (R 3.6.3) #> 2 sessioninfo 1.2.1 2021-11-02 [?] CRAN (R 3.6.3) #> 1 stringi 1.7.5 2021-10-04 [1] CRAN (R 3.6.3) #> 2 stringr 1.4.0 2019-02-10 [?] CRAN (R 3.6.0) #> 1 styler 1.6.2 2021-09-23 [1] CRAN (R 3.6.3) #> 1 tibble 3.1.5 2021-09-30 [1] CRAN (R 3.6.3) #> 1 tidyr 1.1.4 2021-09-27 [1] CRAN (R 3.6.3) #> 2 tidyselect 1.1.1 2021-04-30 [?] CRAN (R 3.6.3) #> 2 utf8 1.2.2 2021-07-24 [?] CRAN (R 3.6.3) #> 2 vctrs 0.3.8 2021-04-29 [?] CRAN (R 3.6.3) #> 2 viridisLite 0.4.0 2021-04-13 [?] CRAN (R 3.6.3) #> 2 withr 2.4.2 2021-04-18 [?] CRAN (R 3.6.3) #> 1 xfun 0.27 2021-10-18 [1] CRAN (R 3.6.3) #> 2 yaml 2.2.1 2020-02-01 [?] CRAN (R 3.6.3) #> #> [1] /home/lrevilla/Documents/projects/howell_2018/renv/library/R-3.6/x86_64-pc-linux-gnu #> [2] /tmp/Rtmpu0Y8wL/renv-system-library #> [3] /usr/lib/R/library #> #> P ── Loaded and on-disk path mismatch. #> #> ────────────────────────────────────────────────────────────────────────────── ```
GFabien commented 2 years ago

Hi @llrs, this is totally normal in your example that a and astar are equal. They are always the same for the first canonical component (here ncomp = 1), they differ for the next components.

To understand the differences between a and astar, we need to have a look at how components are extracted. Let denote X one of the blocks we are analyzing, we are looking for a_1 that maximizes the S/RGCCA criterion, this yields y_1 = X a_1. a_1 corresponds to a[[i]][, 1] in R where i is the index of block X.

We want the next component to be extracted y_2 to be orthogonal (thus having no correlation) with y_1. In order to do so, we using a deflation strategy by rerunning RGCCA on X_1 instead of X with X_1 = X - y_1 (y_1^\top y_1)^{-1} y_1^\top X. Therefore, we find a_2, corresponding to a[[i]][, 2], such that y_2 = X_1 a_2. Here you see that a_2 no longer interacts with the original block, hence if you want to apply the projection defined by a_2 to another dataset, you need to express a_2 in the original space, that is what astar is for. In fact if you denote k the number of the component, and X_0 = X, you have X_{k - 1} a_k = X astar_k.

Hope this explanation is clear!

llrs commented 2 years ago

It is very clear, many thanks for the fast response. But shouldn't that also change due to using scaled data (now defaults scale = TRUE, scale_block = TRUE)? Multiplying the original (given as input) data by the astar values astar_1 on the first dimension doesn't return the final y_1 :

library("RGCCA")
#> Loading required package: MASS
#> Loading required package: MASS

# Create the dataset
data(Russett)
blocks = list(agriculture = Russett[, seq(3)],
              industry = Russett[, 4:5],
              politic = Russett[, 6:11])

# Blocks are fully connected, factorial scheme and tau =1 for all blocks is
# used by default
fit.rgcca = rgcca(blocks=blocks, method = "rgcca", connection = 1-diag(3),
                  scheme = "factorial", tau = rep(1, 3))
summary(fit.rgcca$a$agriculture[, 1] == fit.rgcca$astar$agriculture[, 1])
#>    Mode    TRUE 
#> logical       3
summary(fit.rgcca$Y$agriculture == as.matrix(blocks$agriculture) %*% fit.rgcca$a$agriculture)
#>    comp1        
#>  Mode :logical  
#>  FALSE:47
summary(fit.rgcca$Y$agriculture == as.matrix(blocks$agriculture) %*% fit.rgcca$astar$agriculture)
#>    comp1        
#>  Mode :logical  
#>  FALSE:47

Created on 2021-11-15 by the reprex package (v2.0.1)

GFabien commented 2 years ago

Indeed, we don't take into account the preprocessing of X when computing astar, so the equality is true only when multiplying the preprocessed block X with astar.

library("RGCCA")
#> Loading required package: MASS
#> Loading required package: MASS

# Create the dataset
data(Russett)
blocks = list(agriculture = Russett[, seq(3)],
              industry = Russett[, 4:5],
              politic = Russett[, 6:11])

# Blocks are fully connected, factorial scheme and tau =1 for all blocks is
# used by default
fit.rgcca = rgcca(blocks=blocks, method = "rgcca", connection = 1-diag(3),
                  scheme = "factorial", tau = rep(1, 3), ncomp = 2)
summary(fit.rgcca$a$agriculture[, 1] == fit.rgcca$astar$agriculture[, 1])
#>    Mode    TRUE 
#> logical       3
summary(fit.rgcca$a$agriculture[, 2] == fit.rgcca$astar$agriculture[, 2])
#>    Mode    FALSE 
#> logical       3
summary(abs(fit.rgcca$Y$agriculture - as.matrix(blocks$agriculture) %*% fit.rgcca$astar$agriculture) < 1e-14)
#> comp1           comp2        
#> Mode :logical   Mode :logical  
#> FALSE:47        FALSE:47  
summary(abs(fit.rgcca$Y$agriculture - as.matrix(fit.rgcca$call$blocks$agriculture) %*% fit.rgcca$astar$agriculture) < 1e-14)
#> comp1          comp2        
#> Mode:logical   Mode:logical  
#> TRUE:47        TRUE:47   
llrs commented 2 years ago

I was thinking to use the astar values as a measure of the importance of the variables on the data taking into account the specific values of the scaled data, but it seems I can not do that.

I would suggest to amend the documentation line "Each element of astar is a matrix defined as Y[[j]][, h] = A[[j]]%*%astar[[j]][, h]." that A[[j]] should be the scaled data obtained as you described.

GFabien commented 2 years ago

I would suggest to amend the documentation line "Each element of astar is a matrix defined as Y[[j]][, h] = A[[j]]%*%astar[[j]][, h]." that A[[j]] should be the scaled data obtained as you described.

Yes indeed it can be confusing.