joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
569 stars 187 forks source link

Simulation A in "Waste Not, Want Not" #299

Closed michberr closed 10 years ago

michberr commented 10 years ago

Hi Joey,

With my dataset, I'm most interested in whether my samples cluster together according to different factors (simulation A in your paper), rather than what are the specific OTU's that are differentially abundant between groups (simulation B).

Based on Figure 3, it looks like using proportions of reads has higher accuracy for Bray-curtis and weighted unifrac than other models. In your paper you explain very clearly why using proportions is essentially throwing out data, but the figure suggests that especially for low effect sizes proportions may be more accurate. Can you clarify these results?

Also, I've played around with Deseq2 using your phyloseq_to_deseq function, but I'm confused about how you would use the variance stabilization implemented in that package to make for instance an NMDS ordination. Can you provide an example?

Thanks Michelle

joey711 commented 10 years ago

Michelle,

Yes, as described in the article, the individual proportions in a multinomial (think: vector of relative abundance of OTUs) do not include information about heteroscedasticity, which would require also utilizing the library size along with some model of the uncertainty. This lost information about the uncertainty matters a great deal when comparing an OTU's proportions between samples (or sample classes), and failing to appropriately account for this uncertainty is a large reason that tests for differential abundance based on simple relative abundance resulted in embarrassingly high false positive rates in our simulations.

The application in which we are clustering microbiome samples is a bit different. Since we are going to talk about this on Tuesday, and you'll have the opportunity to ask about this on Thursday, I will hold off on answering just yet. Thanks for this insightful question. I do think you should repeat it during the discussion on Tuesday (or Thursday).

As for a specific example of variance stabilization, the code is included in the supplemental, which we are about to release with the final version of the article in PLoS Computational Biology. I will post the relevant info about that here. I may also mock up a small example of some variance stabilization code and post it here.

Cheers

joey

joey711 commented 10 years ago

Michelle

Question 1, appropriateness of proportions in whole-microbiome comparisons

We touched on this a little in the lecture on Thursday, but not very directly. While rarefying is "throwing out data" by definition from any point of view, when comparing one microbiome sample to another you are comparing the entire vector of proportions at once. This is using (nearly) all the data, especially in the Clustering-Simulation context of various methods for whole microbiome normalization, distance, and clustering, none of which were going to use any estimates of uncertainty. I suppose one could argue for a clustering approach that weights samples differently depending on their library size, so that better-sequenced samples contribute more to the clustering results than poorly-sequenced samples. This might be one way to use the library-size information that was otherwise ignored. Not sure how this would compare with the methods shown in the figure, or what the best implementation would be, but helpful to think about.

Question 2, variance stabilization prior to NMDS

I can't yet "advertise" the supplemental until PLoS finally releases the article and its supporting materials. Almost there.

I will mock up that example and put it here in a follow-up.

joey

joey711 commented 10 years ago

Example of variance stabilized counts in NMDS

Reproducible example, starting from data

Partially reproduce the issue with just ordination, and then with plot_heatmap, using an arbitrary subset of the GlobalPatterns dataset.

library("phyloseq")
packageVersion("phyloseq")
## [1] '1.7.20'
library("DESeq2")
packageVersion("DESeq2")
## [1] '1.2.10'
library("ggplot2")
packageVersion("ggplot2")
## [1] '0.9.3.1'
data("GlobalPatterns")
# For the sake of time and example, keep only an arbitrary subsampling of
# OTUs Take from the most abundant ones...
NTAXA = 1000
keepOTUs = names(sort(taxa_sums(GlobalPatterns), decreasing = TRUE)[1:NTAXA])
GP = prune_taxa(keepOTUs, GlobalPatterns)
# Use `~1` as the experimental design so that the actual design doesn't
# influence your tranformation.
GPdds = phyloseq_to_deseq2(GP, ~1)
GPdds = estimateSizeFactors(GPdds)
GPdds = estimateDispersions(GPdds, fitType = "local")
# plotDispEsts(GPdds)

Now perform the variance-stabilizing transformation and replace the original OTU abundances in a copy of GP with them. We'll call the copy GPvst.

GPvst = GP
otu_table(GPvst) <- otu_table(getVarianceStabilizedData(GPdds), taxa_are_rows = TRUE)
pvst = plot_ordination(GPvst, ordinate(GPvst, "NMDS", "bray"), color = "SampleType")
## Run 0 stress 0.175 
## Run 1 stress 0.175 
## ... New best solution
## ... procrustes: rmse 0.0004844  max resid 0.001544 
## *** Solution reached
pvst + geom_point(size = 7) + ggtitle(paste0("NMDS/Bray-Curtis for VST of top ", 
    NTAXA, " OTUs. GlobalPatterns"))

unnamed-chunk-2

Try the regularized log transformation

GPrl = GP
rld = rlogData(GPdds)
rownames(rld) <- taxa_names(GP)
otu_table(GPrl) <- otu_table(rld, taxa_are_rows = TRUE)
prlt = plot_ordination(GPrl, ordinate(GPrl, "NMDS", "bray"), color = "SampleType")
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1001 
## Run 1 stress 0.1001 
## ... procrustes: rmse 0.0001984  max resid 0.0006378 
## *** Solution reached
prlt + geom_point(size = 7) + ggtitle(paste0("NMDS/Bray-Curtis for RLT of top ", 
    NTAXA, " OTUs. GlobalPatterns"))

unnamed-chunk-3

joey711 commented 10 years ago

I think that wraps up this issue. I will close for now. Please post back if you have any follow-up.

Cheers

joey

michberr commented 10 years ago

Thanks Joey! This is excellent.

mikerobeson commented 10 years ago

Hi Joey,

I have tried to use UniFrac in the "regularized log transformation" example above. That is, I replaced the "bray" with "unifrac" in the line below. So this: prlt = plot_ordination(GPrl, ordinate(GPrl, "NMDS", "bray"), color = "SampleType") became: prlt = plot_ordination(GPrl, ordinate(GPrl, "NMDS", "unifrac"), color = "SampleType")

The use of the "unifrac" option failed. Upon further investigation, the output of UniFrac(GPrl) returns a distance matrix of nothing but 0s. This is not the case if I use Weighted-UniFrac: UniFrac(GPrl, weighted=TRUE).

However, the GPvst example seems to work fine when using either UniFrac or Weighted-UniFrac. I assume there is something wonky going on behind-the-scenes when building up the GPrl data and how that is read by UniFrac?

-Cheers! -Mike

joey711 commented 10 years ago

Sorry, Mike, I can't tell what the problem is from your description alone. Good to know you were able to reproduce my example.

I'm going to venture a guess that the problem with UniFrac has something to do with negative numbers in the transformed table. I would check for those, and consider eliminating them in some way. Say, set them to zero. UniFrac (1) isn't prescribed to handle negative values at all, and (2) requires zero-value counts somewhere to return a non-zero distance. The latter is one of the reasons I don't find (unweighted) UniFrac to be a useful distance under modern (non-obvious) experimental conditions.

Hope that helps. Please post-back when you've narrowed or solved the problem. A few issues from users were also posted about a month ago regarding negative values in log-like transformed data.

mikerobeson commented 10 years ago

That is what I suspected at first. Assuming I did not mess up along the way, the transformed otu_table output of GPrl output consists of only positive numbers, while the GPvst has many negative values. Seems odd that the GPrl would return a UniFrac Matrix of all 0s and not the GPvst.

I'll see if I can create the distance matrices externally and try again. I'll keep you posted if I find out anything else.

mikerobeson commented 10 years ago

Hi Joey,

Well, I've looked into this a bit more. Remembering that UniFrac is basically a presence / absence metric (e.g. sorensen-dice, etc...)... Since all of the values in the OTU table of GPrl have positive values, it basically is saying that all OTUs are present in all samples thus sharing all of the branch-length (obviously because abundance is ignored; a presence absence matrix). Thus leading to the distances to be 0 between all samples as a result.

As this was not the case for the GPvst data, I had assumed that this implementation of UniFrac was interpreting any negative value as a 0 for the GPvst OTU table. But this is not the case (I manually converted all negative values to 0s and got a slightly different but similar NMDS plot compared to leaving the negative values in place for the GPvst data).

So, I am simply replacing the negative values with 0s on the "Variance stabilized counts" output. As per discussion here: http://permalink.gmane.org/gmane.science.biology.informatics.conductor/54777

I'll most likely stick to weighted UniFrac for my analysis anyway. I am just tinkering at this point because I am curious. Also, I should have considered that these transformations (e.g.: "regularized log transformation" and "Variance stabilized counts") are meant more for abundance-based analyses anyway. :-)

joey711 commented 10 years ago

In Waste Not Want Not we simulate samples from either one of a pair of microbiomes that are essentially non-overlapping (Feces, Ocean). If we don't perturb these model microbiomes prior to simulation, (unweighted) UniFrac does very well... but so does any other distance you can imagine. As we perturb the model microbiomes so that they are more similar prior to simulation -- reducing the Effect Size to something more subtle -- unweigthed UniFrac does precipitously worse. In fact, its objective performance is generally terrible under all but the very large effect sizes, conditions under which alternative distances had long since achieved perfect accuracy.

The very-bad performance of u-UniFrac was surprising to me at first, but not for a good reason. I was not surprised because of some intrinsic appreciation for some fundamental principle of microbiomes that u-UniFrac addresses (and I don't believe that it does); but rather, it was surprising to me because it has been so popular in the micrbiome literature and used for so long, often without mentioning or reporting alternative distances. Under experimental conditions in which many/most OTUs can be in present in all sample classes, it follows that u-UniFrac is inappropriate, and we expect its performance as a distance to be quite bad. This is exactly for the reasons you've just described, namely that u-UniFrac requires zeros in the abundance table for it to return non-zero values. It may be possible under special circumstances to hack a solution for u-UniFrac by setting an arbitrary threshold for 'unobserved'/zero values, and rarefying the data is one such hack. And it is a hack... AKA unjustified, inadmissible, ad hoc transformation. Meanwhile, other distances, including weighted-UniFrac, can perform acceptably well, and are not purely dependent on the presence of zeros.

Meanwhile, negative values in your abundance table are not acceptable for any flavor of UniFrac, and are similarly inappropriate for distance measures that expect counts or positive values. I don't want to prohibit negative values in OTU tables (though it would be easy to do) because they might be useful for some methods. An alternative solution that I like would be to catalog the supported distances that cannot take negative values, and throw a warning that negative values are not supported while replacing all negative values to zero. This would inform the user about the negative values without punishing them to write extra code to handle the negatives just because they did a log-like transformation.

Thoughts?

Cheers

joey

mikerobeson commented 10 years ago

Thanks for your response Joey!

I agree that some warning should appear if UniFrac (or some equivalent distance metric with subtle caveats) is used on the transformed data. Even experienced users of UniFrac, like myself, often forget that the values in the OTU table should be zero or a positive value. I had edited / updated my earlier post just before your reply to note that I opted to replace all negative values with 0s in the variance stabilized output.

But for the regularized log transformation, I think it may be more difficult because the transformed OTU table contains all positive values, at least for the data I am working with, and even in the example tutorial here if I am not mistaken. I'd think some way of dynamically calculating a threshold would be in order here? Maybe use scale() (just as a quick-and-dirty example for a function that will convert some values to negative numbers) on the rld variable you have in your example, and then use that directly in the analysis or as a otu_table mask? Then set to zero, any value that is scaled to a negative number? Or something akin to this?

I'll think about this more over the next few days as I analyze some of my data using phyloseq.

-Mike