rgcca-factory / RGCCA

https://rgcca-factory.github.io/RGCCA/
10 stars 11 forks source link

When I once used RGCCA, I found that the AVE fraction value of a certain block was NA. What is the reason for this? #63

Closed WUZHExl closed 1 year ago

GFabien commented 1 year ago

Hi @WUZHExl, could you provide a reproducible example and give the version you are using? It may be a bug, and we would happily investigate the question.

Hi @llrs! We are still making a few modifications to the main branch of the repository, but we hope to release this branch on CRAN soon!

WUZHExl commented 1 year ago

No problem, I'm using version 2.1.2,

Here is an example I used where the AVE fraction of module B is NA

I have data here https://pan.baidu.com/s/1kGWnrEgH_tZet_xVymih1A the password is 1234

############################################################ ` A<-readRDS('A.rds') B<-readRDS('B.rds')

data<-list(t(A),t(B))

res<-doRGCCA(data,3,ncomp = c(4,4))

doRGCCA <- function (data, K, C=1-diag(length(data)), ncomp=rep(1, length(data)), scheme="centroid", tau = rep(1,length(data))){

cat('run rgcca\n') result.rgcca = data %>% rgcca(C, tau, ncomp = ncomp, scheme = scheme, verbose = FALSE)

cat('cbind data\n') resDat <- do.call(cbind, result.rgcca$Y)

cat('run clust\n') clust.rgcca <- resDat %>% dist %>% hclust(method="ward.D2") %>% cutree(k=K)

res <- list(clust=clust.rgcca, fit = result.rgcca, resDat=resDat)

return(res)

} `

GFabien commented 1 year ago

Thank you, I need help getting your data. Could you try using the development version of RGCCA available on the main branch of this GitHub? To install it, you can do the following:

remove.packages("RGCCA")
devtools::install_github(repo="https://github.com/rgcca-factory/RGCCA.git", ref = "main")

Feel free to ask if you need help running your code with the new version of RGCCA.

llrs commented 1 year ago

Hi @GFabien, thanks for the heads up. I hope that the CRAN submission will go smoothly. In this release you will introduce a lot of breaking changes, don't forget to take into account the reverse dependencies: multiblock, RegularizedSCA and RVAideMemoire. It might slow down or prevent an initial successful submission.

I might look to the package to start making my package compatible with the future release and submit it to CRAN once the new version is accepted. But I think one of the improvements made mine redundant.

WUZHExl commented 1 year ago

Hello,I'll put the data in the attachment

------------------ 原始邮件 ------------------ 发件人: "rgcca-factory/RGCCA" @.>; 发送时间: 2023年3月3日(星期五) 晚上10:54 @.>; @.**@.>; 主题: Re: [rgcca-factory/RGCCA] When I once used RGCCA, I found that the AVE fraction value of a certain block was NA. What is the reason for this? (Issue #63)

Thank you, I need help getting your data. Could you try using the development version of RGCCA available on the main branch of this GitHub? To install it, you can do the following: remove.packages("RGCCA") devtools::install_github(repo="https://github.com/rgcca-factory/RGCCA.git", ref = "main")
Feel free to ask if you need help running your code with the new version of RGCCA.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

GFabien commented 1 year ago

I tried running your code on your data with the most recent version of the package (from branch main of this GitHub). This instruction was not working:

result.rgcca = data %>% rgcca(C, tau, ncomp = ncomp,
    scheme = scheme,
    verbose = FALSE)

I replaced it with:

result.rgcca = data %>% rgcca(connection = C, tau = tau, ncomp = ncomp,
    scheme = scheme,
    verbose = FALSE)

It worked fine, and I didn't see any NA in the AVE. Please tell me if you still face problems after updating the package.

WUZHExl commented 1 year ago

Thank you for your reply. It helped me a lot.

I solved this problem by using the latest version of RGCCA as you said.

But there was another question.

When I used RGCCA to merge the previous data, there was no NA in rgcca$AVE, but the number of rows in a block in rgcca$a was reduced.

GFabien commented 1 year ago

Glad to hear it helped you!

It is expected, but it may be worth documenting this better. What happens is that 81 columns in block B are constant across all samples. Therefore, they don't provide any information, which is why they are removed from the analysis and not present in rgcca$a.

llrs commented 1 year ago

Imho, it would be better to detect those constants before any calculation takes place and then first that before and then report/stop than returning less output than input.

GFabien commented 1 year ago

I agree with you, @llrs. I also thought it could be disturbing. Would a warning explaining that some variables have been removed because they were constant be enough?

WUZHExl commented 1 year ago

I am using RGCCA to integrate multi-modal data and use the integration results for clustering, in which rows are genes and columns are samples.

I hope to get a list of genes by calculating contribution, so as to conduct relevant downstream analysis.

Therefore, I want to know:

1.The contribution of a variable (gene) to clustering on a block; 2.The contribution of a block to clustering; 3.And the contribution of this variable to clustering as a whole;

Here's how I calculate it:

Calculate the contribution of variables in a block to clustering through rgcca$a: A_weights <- rgcca_output$a$A a_contribution_per_dimension <- abs(A_weights["variable a", ])/colSums(abs(A_weights)) a_contribution_across_dimensions <- sum(a_contribution_per_dimension)

The contribution of a block to clustering is calculated by rgcca$AVE.

I wonder if this calculation is reasonable;

If so, then as mentioned earlier, can a missing row value in a block of rgcca$a be assigned to 0?

GFabien commented 1 year ago

If so, then as mentioned earlier, can a missing row value in a block of rgcca$a be assigned to 0?

In the meantime, you can do it with the following code snippet:

a <- lapply(seq_along(data), function(j) {
  x <- matrix(0, nrow = ncol(data[[j]]), ncol = result.rgcca$call$ncomp[j])
  rownames(x) <- colnames(result.rgcca$call$blocks[[j]])
  x[rownames(result.rgcca$a[[j]]), ] <- result.rgcca$a[[j]]
  return(x)
})

When talking about rgcca$AVE, I guess you talk about rgcca$AVE$AVE_X. This block AVE (for Average Variance Explained) reflects the proportion of the block variance captured by each component. It makes sense here to sum them up for a given block. However, I am unsure how to relate this quantity to the block contribution to clustering. I guess if you clustered your genes based on the results of RGCCA, you could look at these block AVEs and compare them.

I think your computation for the variable contribution seems reasonable. Many criteria could work.

llrs commented 1 year ago

Would a warning explaining that some variables have been removed because they were constant be enough?

It is reasonable as you already dealt with the problem downstream but in my opinion not enough because users should consider before doing the analysis why their variables are constant.

GFabien commented 1 year ago

@WUZHExl, I'm closing this issue. Feel free to open a new issue if you have other questions.