DESeq variance stabilization and community distance matrices

werdnus commented 9 years ago

I've been going through the phyloseq tutorials, trying to implement the negative binomial variance stabilization by way of DESeq, (as in the 'Waste not, want not' paper) and I'm running into a bit of a snag with moving on to downstream analysis. I was wondering if I could get some advise.

I think the best example of the sort of thing I'm trying to do is sort of like what's presented in this thread.

I can get my phyloseq object with the variance stabilized counts in the OTU table, and see a pretty similar pattern in the change to the heat map as in the example. Where I'm confused is how to take that into downstream analysis: I'd like to generate community distance matrices with these variance stabilized counts, to examine changes in community similarity with environmental and physical distance between sampling locations, but the variance stabilizations include lots of negative numbers, which makes standard distance calculations (e.g., Bray-Curtis) impossible, because they do not work with negative numbers.

Where @joey711 uses distance calculations in his examples (e.g., the ordination tutorial), he doesn't use the DESeq variance stabilized counts, but instead a proportional transformation (e.g., 1e+06 * x/sum(x)).

I've found one thread talking about issues with the geometric means calculation where the poster (@btemperton) converts all of the negatives to zeros for distance calculation and ordination. The whys and hows of this are not really discussed in that thread, since it was off the topic of geometric means, but this seems a bit like throwing the baby out with the bath: you lose all of the information in the differential abundances of low abundant taxa (as well as the effect of presence/absence), and as such seems just as problematic as rarefying, particularly given the proportion of taxa in the low-abundance category in most microbiome datasets. I tried it, and (as I suspected) see a homogenizing effect in the ordination as compared to proportionally transformed counts.

I feel like the answer has got to be something relatively simple, but I've been googling up a storm, re-reading the distance metric sections in all my stats text books, and generally just banging my head against a wall for a couple of weeks now, so I thought it was time to ask for help.

Here's some code that reproduces the issue:

# load the libraries
library(phyloseq)
library("DESeq2")

# random OTU table
trial <- matrix(,10,100) 
for(i in 1:10){`
        trial[i,] <- sample(0:3,100, replace = T)
        }
trial.otu <- otu_table(trial, taxa_are_rows=F)

# some sample data 
trial.sam.data <- factor(c("a","a","a","a","a","b","b","b","b","b"))
trial.data <- sample_data(as.data.frame(trial.sam.data))

# biuld the phyloseq object
trial.phyloseq <- phyloseq(otu_table(trial.otu), sample_data(trial.data))

# test the ordination
trial.ord <- ordinate(trial.phyloseq, "NMDS", "bray")
plot_ordination(trial.phyloseq, trial.ord, type = "samples", color = "trial.sam.data")
# side note: I don't understand why this won't return colours for the two sample classes specified above.

# DESeq conversion 
trial.DEseq = phyloseq_to_deseq2(trial.phyloseq, ~trial.sam.data)

# DESeq2 Variance Stabilization
# You must step through the size factor and dispersion estimates prior to calling the getVarianceStabilizedData() function.
trial.DEseq = estimateSizeFactors(trial.DEseq)
trial.DEseq = estimateDispersions(trial.DEseq)
trial.vst = getVarianceStabilizedData(trial.DEseq)
dim(trial.vst)

# Save the untransformed data as a separate variable so you can go back to it
trial.phyloseq0 = trial.phyloseq

# add the varience stabilized otu numbers into the dataset:
otu_table(trial.phyloseq0) <- otu_table(trial.vst, taxa_are_rows = TRUE)

# Now, we re-do the ordination
trial.ord <- ordinate(trial.phyloseq0, "NMDS", "bray", autotransform=T)
plot_ordination(trial.phyloseq, trial.ord, type = "samples", color = "trial.sam.data")

Which gives me these errors:

Warning messages: 1: In metaMDS(veganifyOTU(physeq), distance, ...) : 'comm' has negative data: 'autotransform', 'noshare' and 'wascores' set to FALSE 2: In distfun(comm, method = distance, ...) : results may be meaningless because data have negative entries in method “bray” 3: In postMDS(out$points, dis, plot = max(0, plot - 1), ...) : skipping half-change scaling: too few points below threshold

Any help would be appreciated.

MTutino commented 8 years ago

Hi! I am encountering the same problem Did you find a solution to it?

Thanks, Mauro

joey711 commented 8 years ago

Honestly, you'll probably just want to use "simple proportions" for the distance metrics being discussed. Not all transformations are appropriate for all distances.

This is explained in more detail on the new phyloseq-FAQ answer for exactly the problem of negative numbers in variance stabilized output.

Let me know if I should add a little more detail there.

MTutino commented 8 years ago

Thank you! I have read papers using DESeq2->Bray->NMDS and I thought to try it. I now see it is actually not the good way to proceed. The FAQ is pretty clear, it has been very helpful. Thank you very much for your help.

joey711 commented 8 years ago

Glad it helped.

Couple points:

DESeq2 is a large and complex R package, not a data transformation.
DESeq2::vst --> Bray-Curtis metric --> Something else ... might be fine for some datasets. I didn't say anything about its general appropriateness (or lack thereof).

Just to make sure it is also described here, and not just at that link, negative numbers after log-like transformation like vst represent "less than one count" after rescaling. Hardly @werdnus 's "throwing away the baby with the bath" for one to truncate these values, assign them a floor, etc. There are a couple of important points here.

Robustness. Your (less-formal) inferences during exploratory approaches like ordination should be robust to these tiniest of observed counts, even in their ensemble/aggregate. If not, then you may want to be more careful about the patterns you thought you were seeing.
Denoising. Nearly all of the common OTU methods are terrible w.r.t. false positives -- OTUs that are actually just PCR/sequencer artifacts -- with UPARSE being the best among them by a long ways, thought not open source. A large fraction of the OTUs/features/taxa that posses these tiny values will disappear after an effective denoising.

See this comparison of DADA2 v. 'State of the Art' OTU methods for a useful benchmark on mock and simulated data.

Also see the DADA2 home page for details and instructions on how to use this. We've found that it generally gives impressive results w.r.t. false positives, in addition to providing exact sequence resolution (no OTU clustering needed). The ~negligible false features means many fewer of these pesky fake rare taxa to worry about and muck up downstream methods, and likely a more reliable parametric transformation like DESeq2::vst, etc.

I would have mentioned sooner, but the package was still in progress and still undergoing validation tests, etc. The paper is coming out soon, now, though; and the package has been released on the latest version of Bioconductor:

https://www.bioconductor.org/packages/release/bioc/html/dada2.html

Hope that helps.

Best

joey

werdnus commented 8 years ago

Ah, I suppose I should have left a comment earlier. Left to my own devices with no one commenting on this thread, I did some exploration (tested out the MetaGenomeSeq package's CSS transformation, among other things) and discovered that all of the negatives in my transformed set were actually zeros in the original count data. So yeah, not throwing the baby out with the bath, indeed.

Thanks for getting that FAQ up Joey—it'll save a lot of people the same sort of rough learning curve I went through, I think.

joey711 commented 8 years ago

@werdnus sorry for my long delay, as well, and for your trouble. I'm glad you found some alternatives and did some sleuthing. Hope that data looks good!

MTutino commented 8 years ago

Hi Joey,

I have a very stringent pipeline for denoising which uses Vsearch for chimera detection and OTU clustering. I already went through DADA2 and I will test it very soon. So I am very happy with that.

Now I am trying to understand the best way to visualise my data. I do not think the bray-curtis applied on the DESeq2::vst or DESeq2::rlog microbiome transformed data is the best way to proceed. It would require me to modify the data to avoid the production of negative numbers (like adding an arbitrary constant to each value before DESeq analysis or setting every negative number as zero after the analysis). I would prefer not to play with the data this way.

I thought it was a good method transforming the data to take into consideration the depth before calculate the distance/dissimilarity metrics and then visualise the data. What do you think about it? I am not a statistician and I am not very sure what metrics to use on the transformed data.

Thanks.

spholmes commented 8 years ago

I'll jump in here.

Different questions require different transformations; one size does not fit all. For differential abundance testing, there is no doubt that DESeq2 is the way to go. However the structure and qualities of your data dictate the visualization step.

I agree that Bray Curtis is not the best way to calculate relevant distances after transformation of the data. We usually use just PCA on the transformed data but not because of the negative numbers just because DESeq makes a transformation that you can think of a log in the large scale and making ratios of logs doesn't make sense. If you already took care of the multiplicative effects your don't need to do it again with the ratios as used in Bray Curtis.

For ordination purposes, you can keep the data as counts and use correspondence analysis which takes the depth and ratios into account.

If you want to take the tree into account in your analyses we have been using DPCOA with a lot of success. It really depends on how different your sample depths are.

Hope this helps, Susan

On Wed, May 11, 2016 at 3:00 AM, MTutino notifications@github.com wrote:

Hi Joey,

I have a very stringent pipeline for denoising which uses Vsearch for chimera detection and OTU clustering. I already went through DADA2 and I will test it very soon. So I am very happy with that.

Now I am trying to understand the best way to visualise my data. I do not think the bray-curtis applied on the DESeq2::vst or DESeq2::rld microbiome transformed data is the best way to proceed. It would require me to modify the data to avoid the production of negative numbers (like adding an arbitrary constant to each value before DESeq analysis or setting every negative number as zero after the analysis). I would prefer not to play with the data this way.

I thought it was a good method transforming the data to take into consideration the depth before calculate the distance/dissimilarity metrics and then visualise the data. What do you think about it? I am not a statistician and I am not very sure what metrics to use on the transformed data.

Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/joey711/phyloseq/issues/492#issuecomment-218415438

Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/

joey711 / phyloseq

DESeq variance stabilization and community distance matrices #492