tpq / propr

[OFFICIAL] [CoDA] An R package to calculate proportionality and other measures for compositional data
67 stars 9 forks source link

Question about scaling propr to very large data sets (Part 2) #24

Open sapuizait opened 3 years ago

sapuizait commented 3 years ago

Hi all

First of congrats on your wonderful software, just read about it and it looks very promising. I have a question/request.

I have two datasets (coming from the same samples (same fastq files) but have been generated using different databases) and I want to compare the elements (features) of dataset A against the elements of dataset B only. This is simply because otherwise this project is not feasible. Dataset A has 200k features and dataset B 40k features. And thus I want to check for correlations of the A[1] ~ B[1] A[1] ~ B[2] A[1] ~ B[3] .... A[2] ~ B[1] A[2] ~ B[2] .... etc etc

Now normally (using the good ol spearman -apologies for speaking the name who shall not be spoken out loud- ) I would use sth like this: rho <- matrix(NA, nrow = ncol(x), ncol=ncol(y)) for (i in 1:ncol(x)) {for (j in 1:ncol(y)) { rho[i, j] <- cor.test(x[,i], y[,j], method = "spearman")$estimate}}

(x and y are the two datasets)

but when I tried propr(x[1]~ y[2], ivar = NA) or

propr(x[1], y[2], ivar = NA)

does not seem to work... so my question is: is this possible at all? ps: my data are already clr transformed, this is why I used 'ivar=NA'

thanks in advance

tpq commented 3 years ago

Thanks for your interest in propr! It sounds like the data can be thought of as "multi-omics data" in that you have two views of the same subjects. I'll assume A and B are both compositional. Could you please have a look at https://github.com/tpq/propr/issues/9 and let me know if you have any further questions? Best, tpq

sapuizait commented 3 years ago

Hey tpq Thanks for your reply, yes data are compositional and 0s have been imputed using the zCompositions package. I actually did look at the multiomics thread and i tried the example you suggested with the 'data(iris)' However, as far as I can see the solution works well for merging the two datasets and comparing all features, meaning; if dataset A has 3 features and B has 2 features, then the comparisons will be A1 vs A2, A1 vs A3, A1 vs B1, A1 vs B2, A2 vs A3, A2 vs B1, A2 vs B2, A3 vs B1, A3 vs B2 which is all possible comparisons. What I am looking for is a solution that will allow me to specify that I only want to compare features of dataset A with features of dataset B but not the features of each dataset with each other. Therefore the desired output would only look the comparisons between: A1 vs B1 and A1 vs B2, A2 vs B1 and A2 vs B2 and finally A3 vs B1 and A3 vs B2

Is that even possible or should I just try the #9 solution and then pick the pairs that I care about?

tpq commented 3 years ago

Ah, I see. Unfortunately, I have not made any functions for this case. The #9 solution would be the way to go. You may find the functions getMatrix() and getRatios() helpful for parsing the results. Though, I'm afraid you might run into performance issues with 240k features (unless you have a very big computer!).

You might want to try something like...

devtools::install_github("tpq/propr")
library(propr)
clr <- function(x) sweep(log(x), 1, rowMeans(log(x)), "-")
A <- clr(matrix(rpois(100*3, 100), 100, 3))
B <- clr(matrix(rpois(100*30, 100), 100, 30))
for(col_j in 1:ncol(A)){
  REL_j <- cbind(A[,col_j], B)
  pr <- propr(REL_j, ivar = NA)
  print(getMatrix(pr))
}

Each j-th element in the for loop would give you proportionality between A[,j] and B. It'll also break down the 240k^2 results into 200 separate 40k^2 results, which is easier to parallelize.

sapuizait commented 3 years ago

Great! Thanks a lot! I will give it a try and see how it goes. I m running everything on a cluster so hopefully it should be feasible!

sapuizait commented 3 years ago

Hey, works nicely, thanks a lot! One quick question and then I ll close the topic: I cannot use the fdr. Not only because of the large dataset but also because I am primarily interested in the negative 'rho' values (and as far as I understand fdr is only for positive values). I used to calculate pvalues using t-approximation from the correlation matrix, but I havent tried to apply sth like that here - What is your opinion on this? I am of course inclined towards dropping pvalues and fdr completely and presenting the highest correlations. Again, thanks for your help

tpq commented 3 years ago

Our reason for calculating FDR on only positive values is that the negative values can be difficult to interpret (see #4). I don't feel comfortable endorsing exact p-value calculation from rho using t-approximation, simply because I don't know enough to know whether the approach is valid. Permutation is computationally expensive, but also easy to implement. Have a look at propr:::updateCutoffs.propr source code for an example.

  for(cut in 1:nrow(FDR)){ # for each user-specified cutoff...

    # Count positives as rho > cutoff, cor > cutoff, phi < cutoff, phs < cutoff
    if(object@metric == "rho" | object@metric == "cor"){
      FDR[cut, "truecounts"] <- sum(object@results$propr > FDR[cut, "cutoff"])
    }else{ # phi & phs
      FDR[cut, "truecounts"] <- sum(object@results$propr < FDR[cut, "cutoff"])
    }

    FDR[cut, "FDR"] <- FDR[cut, "randcounts"] / FDR[cut, "truecounts"]
  }

You could, for example, compute a small null distribution of negative rho for each A[,j]. This would break up the FDR computations into parallelizable chunks.

sapuizait commented 3 years ago

I see. That is a bit unfortunately as I am primarily interested in negative correlations. Is there any other metric or any other way I could try to look for negative correlations besides the negative r values? or is there any way to exclude false positives?

tpq commented 3 years ago

I think the key issue is that the negative rho (like negative correlations) would be highly sensitive to the choice of the reference. They may mean something if you believe the CLR is a suitable normalization method. Without normalization, I do not know of any ways to obtain true negative pairwise associations. In the case of multi-omics data analysis, however, you might be satisfied with knowing what associations exist with respect to the CLR centers. In this case, negative rho (or indeed negative Pearson) may hold some meaning to you. You might find our discussion of this topic helpful https://www.nature.com/articles/s41592-020-01006-1

tpq commented 3 years ago

s41592-020-01006-1.pdf

sapuizait commented 3 years ago

OK, thanks! i ll look at it carefully tomorrow and see if i can figure sth out!

sapuizait commented 3 years ago

Dear Thomas I have looked in the paper you sent me and things are a bit clearer now. Thanks a lot for helping with this. I realized a mistake that I had done was to use the same normalization reference for both datasets but i went back and fixed that. That way the datasets are completely independent and comparing features from A to B should not cause any problems. As far as I can see your scientific reports paper describes that weird false-positive-negative-correlations event when comparing features that come from the same dataset and have been corrected against the same geometric mean. i also completely understand why for A and B the clr has to be applied independently. Again, thanks!