rgcca-factory / RGCCA

https://rgcca-factory.github.io/RGCCA/
10 stars 11 forks source link

Different result with previous versions #31

Closed llrs closed 2 years ago

llrs commented 2 years ago

On previous versions of the RGCCA package this test passed:

test_that("soft.threshold works", {
  x <- c(-0.228021798300704, 1.307526215182, -1.33214728362879, 0.0607459286046758, 
-1.7414344632077, 1.25994077812035, 0.245983479419414, 0.165889497257229, 
0.674060397637496, 0.659960576376325)
  out <- soft.threshold(x, 0.5)
  expect_equal(out, c(0, 0, 0, 0, 0, 0, 0, 1.34653177497057e-10, 0, 0))
})

However, with the current version on the CRAN branch (and on master) it is c(0, 0, 0, 0, -0.5, 0, 0, 0, 0, 0). I saw that there is now a proj_l1_l2 function that's called inside the soft.threshold function, so some work when on the function, and the internal BinarySearch function is not used on the package on the CRAN branch.

In addition, the previous version worked without warnings when using soft.threshold(x, 0.1), while the CRAN branch generates a warning:

Warning message:
In sol[which(abs(argu) == MAX)] <- sign(argu) * a/nMAX :
  number of items to replace is not a multiple of replacement length

This is one of the pieces I didn't understood of the the CRAN implementation. Is this change intentional? Was the previous soft threshold wrong? Any comment on this will be greatly appreciated.

llrs commented 2 years ago

An unrelated change I found is that previously the result of scaling and matrix multiplication by the astar resulted on the Y values returned. Now the original data can be used without scaling (as it was already documented). Many thanks for changing that!

AGloaguen commented 2 years ago

Dear Luis,

Thank you very much for your involvment in this package.

First, do you use the "CRAN" branch? Because this following warning :

Warning message: In sol[which(abs(argu) == MAX)] <- sign(argu) * a/nMAX : number of items to replace is not a multiple of replacement length

was indeed previoulsy identified and is now taken care of hopefully as you can see below :

image

If you do use the "CRAN" branch, could you provide a reproducible example? Because when I try soft.threshold(x, 0.1) on the current CRAN branch version, there isn't any warning.

Furthermore, indeed BinarySearch was replaced by proj_l1_l2 (and should be removed from the function in the current version, thank you). proj_l1_l2 is faster and respect more the constraints. Concerning the first test you mentionned, I am confused by expect_equal(out, c(0, 0, 0, 0, 0, 0, 0, 1.34653177497057e-10, 0, 0)). Because the norm1 of out should be equal to 0.5 and here you test it against a vector of norm1 equals to 1.34653177497057e-10.

Best,

Arnaud

AGloaguen commented 2 years ago

Here is the paper where we compared proj_l1_l2 and BinarySearch : https://hal.archives-ouvertes.fr/hal-01630744/document

llrs commented 2 years ago

Dear Arnaud,

Sorry, I confused the branches I had loaded, the warning is only present on the master branch and as you said it is fixed by that check.

About the test with test_that, I copy-pasted this test from the checks I had and run it with the CRAN branch. I am using 1.34653177497057e-10 because this is what the CRAN version (not the CRAN branch) returns. I confirmed with a fresh install of the package from CRAN on R version R 4.0.3 and R 4.1.1. The reason why I raised this issue is because this affects the results of sgcca and wanted to be sure that this was a desired change or was fixing a bug.

If it is a bug that will be fixed with the next release I might need to switch to this development version of the package for all my work and I might check previous publications that used the RGCCA package.

The new soft.threshold is 8 times faster than the previous version! Many thanks for this improvement!

Best,

Lluís

AGloaguen commented 2 years ago

Dear Lluis,

I am happy that you appreciate the speed improvement of soft.threshold .

A quick answer is : no this bug is not going to be fixed and the reason is because it is not a bug.

Indeed, the development branch "CRAN" is the closest branch to what is going to be released and for now, the part on proj_l1_l2 is going to stay still normally. This is not a bug, because the output that you are evaluating in testthat (c(0, 0, 0, 0, 0, 0, 0, 1.34653177497057e-10, 0, 0)) was previoulsy produced by BinarySearch and is not a valid output. This is explained by the following example :

library(RGCCA)
#> Le chargement a nécessité le package : MASS
x <- c(-0.228021798300704, 1.307526215182, -1.33214728362879, 0.0607459286046758, 
       -1.7414344632077, 1.25994077812035, 0.245983479419414, 0.165889497257229, 
       0.674060397637496, 0.659960576376325)

library(RGCCA)
out_Binary     = RGCCA:::soft(x = x, 
                              d = RGCCA:::BinarySearch(argu = x, sumabs = 0.5))
out_proj_l1_l2 = RGCCA:::soft.threshold(x, 0.5)

#Check norm2
norm(out_Binary, type = "2")
#> [1] 1.013649e-10
norm(out_proj_l1_l2, type = "2")
#> [1] 0.5

#Check norm1
sum(abs(out_Binary))
#> [1] 1.013649e-10
sum(abs(out_proj_l1_l2))
#> [1] 0.5

#Check scalar product
crossprod(x, out_Binary)
#>              [,1]
#> [1,] 1.765204e-10
crossprod(x, out_proj_l1_l2)
#>           [,1]
#> [1,] 0.8707172

Created on 2021-10-01 by the reprex package (v2.0.0)

Indeed, the two previous solutions (one proposed by BinarySearch and one by proj_l1_l2) do respect the constraints. Both their L2-norm is below or equal to 1 and their L1-norm is below or equal to 0.5. However, they should respect these constraints while maximizing the scalar product with the original vector x. Here, we can see that proj_l1_l2 proposes a far better solution than BinarySearch in terms of scalar product.

However, the good news for you is that, even if the output of this test was changed, it should not affect too much your results. Indeed this test was changed in the new version because we realized that it was not a good one. Indeed, the sparse parameter in the test you mentionned is fixed to 0.5. However, for BinarySearch to work it shoud be in [1; sqrt(p)]. So, to be clear, for the sparse parameter in sgcca we ask for it to be in [1/sqrt(p); 1], however, when we give it to soft.treshold, this parameter had been re-scaled by the function and is in [1; sqrt(p)]. So the test was not good and now, we test for 1.5 and not 0.5 anymore.

So in the end, yes I advise you to use the development branch "CRAN" and my guess is your results would not change too much.

I hope this was clear. Please, do not hesitate if you have more questions.

Best,

Arnaud

llrs commented 2 years ago

Many thanks for the clear and detailed explanation, I understand that the current released version of RGCCA that I've been using didn't respect the last constrain to maximize the scalar product with the original vector. On just a few tests I found this reduces the number of iterations to converge, so it makes the sgcca function even faster.
As you said this only changed the results slightly on the order of 0.000109 for the outer weights of the first block on the example's data. All these reasons convinced me to use the CRAN branch of the package on my analysis.

A minor difference I found while comparing results is that the class of the rgcca(..., mode = "sgcca") is rgcca while the old class of the object returned by sgcca was sgcca.

AGloaguen commented 2 years ago

Happy that you were convinced by this new version !

About your last point, you are right. We will discuss if we keep the output class as it is now or not.

Thank you again for your interest in the package.

Best,

Arnaud