tpq / propr

[OFFICIAL] [CoDA] An R package to calculate proportionality and other measures for compositional data
64 stars 8 forks source link

Question about how to interpret rho and FDR #17

Open taylorreiter opened 4 years ago

taylorreiter commented 4 years ago

Hi @tpq, thank you for the wonderful CoDa methods you have implemented in propr! I read Microbiome Datasets Are Compositional: And This Is Not Optional at the beginning of my PhD and am excited to be using the methods now.

I've read your the following posts/papers, but I still have some questions about how to interpret and use propr.

Is there an analogy to draw between propr and e.g. R^2 from spearman correlation, where abs(rho@result$propr) indicates higher proportionality? I know the math is completely different, but I have no intuition for the meaning of propr or lrv. (I've seen you state that you're skeptical of negative rho, so I'm assuming these should just be ignored.)

    Partner Pair        lrv metric alpha        propr
1         2    1  1.3139167    rho  <NA>  0.809043652
2         3    1  2.4083228    rho  <NA>  0.711525696
3         3    2  3.0522840    rho  <NA>  0.637794103
4         4    1  3.8199330    rho  <NA>  0.600589599
5         4    2  4.2658219    rho  <NA>  0.557596470
6         4    3  2.2894219    rho  <NA>  0.793934072
7         5    1 17.3747962    rho  <NA>  0.050633754
8         5    2 15.2983714    rho  <NA>  0.167658262
9         5    3 22.5732504    rho  <NA> -0.137324292
10        5    4 26.1975924    rho  <NA> -0.243765501

I also don't understand how to interpret FDR from updateCutoffs(rho). Currently I'm arbitrarily using rho95 <- rho[">", .95] because I'm not sure how to interpret this table. I think this is telling me I could use 0.65 given that the FDR falls below .05 at this value...but I wanted to check.

  cutoff randcounts truecounts          FDR
1   0.05   11936.30      10353 1.1529315174
2   0.35     485.26       2443 0.1986328285
3   0.65       0.03        299 0.0001003344
4   0.95       0.00         19 0.0000000000

Lastly, I've seen your implementation for multi-omics data sets here and have it working. I have 16s, ITS, and metatranscriptomic data from the same N samples. I understand how to calculate proportionality between these datasets using your implementation, but I'm curious if I can calculate proportionality between environmental variables as well...we also took measurements like pH, nitrogen, latitude, longitude, etc., and I would like to know if any ASVs/transcripts are proportional to these variables.

tpq commented 4 years ago

Hey Taylor!

Thanks for your interest in propr, and for asking such detailed questions!

I'll try to answer them quickly now, but I also invite follow-up questions.

(1)

Is there an analogy to draw between propr and e.g. R^2 from spearman correlation, where abs(rho@result$propr) indicates higher proportionality?

There is some analogy to draw between the two, and indeed higher values indicate stronger associations. The strength of the associations are defined with respect to the CLR transformation. When rho=1 for a pair i,j -- a plot of OTU i vs. OTU j will give a perfect line where the slope is 1 for OTUs expressed in CLR coordinates. (It can be helpful to think of the CLR as a kind of normalization that scales abundances according to the average for a sample. From this perspective, propr is sort of like a correlation between normalized variables.).

(2)

I also don't understand how to interpret FDR from updateCutoffs(rho). Currently I'm arbitrarily using rho95 <- rho[">", .95] because I'm not sure how to interpret this table.

I would read the table something like this -- if the cutoff were 0.65, the FDR, empirically speaking, would be 0.0001. So you can select it very safely! On the other hand, if the cutoff were 0.35, the FDR would be 20%. If you wanted an FDR of, say 5%, the 0.65 cutoff is too conservative. In this case, I would re-run

updateCutoffs(rho, cutoff = seq(0.35, .65, 0.05))

This will try additional cutoffs from 0.35 to 0.65. One of them may have your target FDR.

(3)

Lastly, I've seen your implementation for multi-omics data sets here and have it working. I have 16s, ITS, and metatranscriptomic data from the same N samples.

Great! I am glad it is useful! FYI we have written a pre-print about the methodological justification for "double CLR" approach here -- https://www.biorxiv.org/content/10.1101/847475v1

``but I'm curious if I can calculate proportionality between environmental variables as well...we also took measurements like pH, nitrogen, latitude, longitude, etc., and I would like to know if any ASVs/transcripts are proportional to these variables.''

Proportionality is useful for finding associations between components belonging to a composition. Variables like nitrogen, latitude, longitude, etc. are not components, but rather real measurements. In this case, I would CLR transform my OTUs manually, and fit a linear model per usual. The pseudo-code might look something like this:

clr <- function(x) log(x) - mean(log(x))
OTUs_as_CLR <- clr(OTUs+1)
lm(pH ~ OTUs_as_CLR)

Or you could popular multivariable ecology methods like RDA from vegan package:

library(vegan)
rda(OTUs_as_CLR, environmental_variables)
taylorreiter commented 4 years ago

Thank you! these explanations were extremely helpful.