rgcca-factory / RGCCA

https://rgcca-factory.github.io/RGCCA/
10 stars 11 forks source link

Speed stopping criterion computation by directly using unlist #36

Closed GFabien closed 2 years ago

GFabien commented 2 years ago

Current way to compare a and a_old in rgccak and sgccak can be very slow when blocks have a lot of variables. This PR aims to provide a new implementation that scales better with the number of variables. Old and new implementations have been compared on 3 configurations:

### 1st config
# Used blocks
require(gliomaData)
data(ge_cgh_locIGR)
blocks <- list(GE1 = ge_cgh_locIGR$multiblocks$GE, GE2 = ge_cgh_locIGR$multiblocks$GE)
connection <-  1 - diag(2)
# Measured call
rgcca(blocks = blocks, connection = connection,
        sparsity = c(.071,.071), ncomp = 2,
        scheme = "centroid", verbose = F, method = "sgcca")

### 2nd config
# Used blocks
blocks <- ge_cgh_locIGR$multiblocks
connection <-  matrix(c(0, 0, 1, 0, 0, 1, 1, 1, 0), 3, 3)
# Measured call
rgcca(blocks = blocks, connection = connection,
        sparsity = c(.071,.2, 1), ncomp = c(2, 2, 1),
        scheme = "centroid", verbose = F, method = "sgcca")

### 3rd config
# Used blocks
data(Russett)
blocks = list(agriculture = Russett[, seq(3)],
              industry = Russett[, 4:5],
              politic = Russett[, 6:11])
# Measured call
rgcca(blocks = blocks, method = "rgcca", connection = 1 - diag(3),
        scheme = "factorial", tau = rep(1, 3))

Using microbenchmark, we get the following results:

1st config:
Unit: seconds
 expr      min    mean   median      max neval
  new 2.238619  2.4353 2.367956 2.753959    10
  old 30.00891 30.3169 30.26584 30.94634    10

2nd config:
Unit: seconds
 expr      min     mean   median      max neval
  new 1.189144 1.279134 1.278469 1.379503    10
  old 1.173425 1.243162 1.236418 1.324785    10

3rd config:
Unit: milliseconds
 expr      min     mean   median      max neval
  new 4.193016 4.991649 4.459408 88.17322 10000
  old 4.307242 5.196242 4.741204 74.25639 10000

For not too big vectors, both implementations behave similarly (configs 2 and 3). For big vectors, new implementation is much faster than the old one (config 1).