mixOmicsTeam / mixOmics

Development repository for the Bioconductor package 'mixOmics '
http://mixomics.org/
153 stars 51 forks source link

Fix for Issue #148 #220

Closed Max-Bladen closed 2 years ago

Max-Bladen commented 2 years ago

Users requested some time ago to include a feature within our sparse methods which allows for them to pass in a specific set of variables to be guaranteed to be included in the model. Relevant user facing functions (spca, (mint).(block).spls(da)) all now take the retain.feats parameter to allow this funcitonality.

Check.retain.feats() handles checks and pre-processing of the input retain.feats. This parameter influences the regularisation that occurs within soft_threshold_L1() called by sparsity(). Along with select_feature, all features included as part of retain.feats do NOT have their loadings reduced to 0.

Max-Bladen commented 2 years ago

Hi @aljabadi,

I need a bit of advice for this one. I've successfully implemented a way to retain user specified features in sparse functions via the new retain.feats parameter. It can take either indices or names of the desired features to retain and is implemented for spca, spls, splsda, mint.spls, mint.splsda, block.spls and block.splsda.

I'll refer to features selected by the sparse method (via keepX) as "selected" features. Those specified by the user (via retain.feats) will be referred to as "retained".

Here's the problem: the loading values of these retained features is severely overestimated. This is due to the selected and retained having their loadings reduced by max(abs.a[!select_feature]) (look here). Using the below code as an example:

data(liver.toxicity)
X <- liver.toxicity$gene[, 1:200]
Y <- liver.toxicity$clinic

colnames(X)[6:10] <- c("A", "B", "C", "D", "E")
colnames(Y)[8:10] <- c("A", "B", "C")

retain.feats <- list(X=6:10, 
                     Y=8:10)

spls.obj <- spls(X, Y, keepX = c(6,1), keepY = c(6,1), retain.feats = retain.feats)

So, the features to retain are named A -> E so they can be easily identified. In a given iteration, the loading values of the selected and retained features are as follows:

           A            B            C            D            E 
    37.56952     13.64893     20.92075      3.42440     20.26891     

    A_42_P649672  A_43_P11568  A_43_P21626 A_42_P454114 A_42_P683537  A_43_P21372 
    63.32374      68.51839     66.69514    66.56536     67.45634      66.31351

Unsurprisingly, the selected features (starting with A_4*) have the maximal absolute loadings, while the retained features have a range of values. These are then each scaled down by the maximal loading of all non-selected features (max(abs.a[!select_feature]) = 62.85361). Now the values look like:

           A            B            C            D            E 
 -25.2840898  -49.2046839  -41.9328549  -59.4292085  -42.5846995    

    A_42_P649672  A_43_P11568  A_43_P21626  A_42_P454114 A_42_P683537  A_43_P21372 
    0.4701328     5.6647843    3.8415326    3.7117548    4.6027321     3.4599011

Now if we look at the final loading values you can see the overestimation of the retained features (particularly on the X dataframe).

plotLoadings(spls.obj)

Created on 2022-06-22 by the reprex package (v2.0.1)

Mathematically, these values are correct but the scaling results in their inflation. I played around with removing the - max(abs.a[!select_feature]) but this obviously resulted in different values for other components of the method (eg. weights and AVE).

My primary question is, seeing as the loading values are scaled down via the L2 norm (sqrt(crossprod(loadings)) - look here) is the - max(abs.a[!select_feature]) necessary? If not, then the removing this subtraction resolves the loading inflation issue, like so:

Created on 2022-06-22 by the reprex package (v2.0.1)

Seeing as the loading values (as well as other components, eg: AVE) change to differing degrees when we remove this subtraction, I'll assume that this subtraction is necessary. Hence, where should I go from here? Is there a better way to implement the retain.feats functionality without this issue? Should I just forget about this feature?

Cheers

mixOmicsTeam commented 2 years ago

My two cents is that mathematically it is not sound for the lasso to have this option (although one could consider it as a constraint lasso of some sort). That said we had implemented it earlier in the package bootPLS for a slightly different purpose (where we realised there was then some overfitting). What I would advise users to do instead is that they perform a sParse method, retrieve their selected variables, then run a classic non sparse model with the selected features and their ‘favourite’ features. That would be the quick fix.

Regards, Kim-Anh

A/Prof. Kim-Anh Lê Cao NHMRC Career Development Fellow School of Mathematics and Statistics, Melbourne Integrative Genomics Bld 184 | The University of Melbourne T: +61 (0)3834 43971 | http://mixomics.org/ | http://lecao-lab.science.unimelb.edu.au/ | @mixOmics_team


From: Max Bladen @.> Sent: Wednesday, June 22, 2022 3:12:51 AM To: mixOmicsTeam/mixOmics @.> Cc: Subscribed @.***> Subject: Re: [mixOmicsTeam/mixOmics] Fix for Issue #148 (PR #220)

Hi @aljabadihttps://github.com/aljabadi,

I need a bit of advice for this one. I've successfully implemented a way to retain user specified features in sparse functions via the new retain.feats parameter. It can take either indices or names of the desired features to retain and is implemented for spca, spls, splsda, mint.spls, mint.splsda, block.spls and block.splsda.

I'll refer to features selected by the sparse method (via keepX) as "selected" features. Those specified by the user (via retain.feats) will be referred to as "retained".

Here's the problem: the loading values of these retained features is severely overestimated. This is due to the selected and retained having their loadings reduced by max(abs.a[!select_feature]) (look herehttps://github.com/mixOmicsTeam/mixOmics/blob/0ff2d1671615bb32ac26f634b80833d4ab1bf0fb/R/internal_mint.block_helpers.R#L133). Using the below code as an example:

data(liver.toxicity) X <- liver.toxicity$gene[, 1:200] Y <- liver.toxicity$clinic

colnames(X)[6:10] <- c("A", "B", "C", "D", "E") colnames(Y)[8:10] <- c("A", "B", "C")

retain.feats <- list(X=6:10, Y=8:10)

spls.obj <- spls(X, Y, keepX = c(6,1), keepY = c(6,1), retain.feats = retain.feats)

So, the features to retain are named A -> E so they can be easily identified. In a given iteration, the loading values of the selected and retained features are as follows:

       A            B            C            D            E
37.56952     13.64893     20.92075      3.42440     20.26891

A_42_P649672  A_43_P11568  A_43_P21626 A_42_P454114 A_42_P683537  A_43_P21372
63.32374      68.51839     66.69514    66.56536     67.45634      66.31351

Unsurprisingly, the selected features (starting with A_4*) have the maximal absolute loadings, while the retained features have a range of values. These are then each scaled down by the maximal loading of all non-selected features (max(abs.a[!select_feature]) = 62.85361). Now the values look like:

       A            B            C            D            E

-25.2840898 -49.2046839 -41.9328549 -59.4292085 -42.5846995

A_42_P649672  A_43_P11568  A_43_P21626  A_42_P454114 A_42_P683537  A_43_P21372
0.4701328     5.6647843    3.8415326    3.7117548    4.6027321     3.4599011

Now if we look at the final loading values you can see the overestimation of the retained features (particularly on the X dataframe).

plotLoadings(spls.obj)

[https://camo.githubusercontent.com/5d50fc38d44b8b41ab4615283b88409e4956af5a183b1f3188732f2097839bd8/68747470733a2f2f692e696d6775722e636f6d2f6a67624b36524f2e706e67]https://camo.githubusercontent.com/5d50fc38d44b8b41ab4615283b88409e4956af5a183b1f3188732f2097839bd8/68747470733a2f2f692e696d6775722e636f6d2f6a67624b36524f2e706e67

Created on 2022-06-22 by the reprex packagehttps://reprex.tidyverse.org (v2.0.1)

Mathematically, these values are correct but the scaling results in their inflation. I played around with removing the - max(abs.a[!select_feature]) but this obviously resulted in different values for other components of the method (eg. weights and AVE).

My primary question is, seeing as the loading values are scaled down via the L2 norm (sqrt(crossprod(loadings)) - look herehttps://github.com/mixOmicsTeam/mixOmics/blob/0ff2d1671615bb32ac26f634b80833d4ab1bf0fb/R/internal_mint.block.R#L648) is the - max(abs.a[!select_feature]) necessary? If not, then the removing this subtraction resolves the loading inflation issue, like so:

[https://camo.githubusercontent.com/9751ba57f855285daf53ac5b4ea06c2722219d0ffef84414264a26cfc46183c9/68747470733a2f2f692e696d6775722e636f6d2f32354a5a48766f2e706e67]https://camo.githubusercontent.com/9751ba57f855285daf53ac5b4ea06c2722219d0ffef84414264a26cfc46183c9/68747470733a2f2f692e696d6775722e636f6d2f32354a5a48766f2e706e67

Created on 2022-06-22 by the reprex packagehttps://reprex.tidyverse.org (v2.0.1)

Seeing as the loading values (as well as other components, eg: AVE) change to differing degrees when we remove this subtraction, I'll assume that this subtraction is necessary. Hence, where should I go from here? Is there a better way to implement the retain.feats functionality without this issue? Should I just forget about this feature?

Cheers

— Reply to this email directly, view it on GitHubhttps://github.com/mixOmicsTeam/mixOmics/pull/220#issuecomment-1162514297, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJOK6LLJBN3PBLFXPCRGITTVQJSBHANCNFSM5ZKVR4GA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Max-Bladen commented 2 years ago

Thanks for the clarification @mixOmicsTeam. The results certainly didn't feel right

I'll inform users of this and put aside this PR. I'll do a bit of research and thinking as to if there is a way this could be implemented but I'll move on for now