joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
584 stars 187 forks source link

Waste not, want not paper questions #282

Closed kfontanez closed 10 years ago

kfontanez commented 10 years ago

I may be jumping the gun on this one a little bit and these questions may be answerable once you release the vignettes and Rmarkdowns associated with your "Waste not, want not" paper. If so, I look forward to taking a look at them!

After reading the latest version of the paper (version 2) and looking at the vignette available in the latest phyloseq development version using the phyloseq_to_deseq2 and DEseq functions - I have two questions.

  1. One of the suggestions in your paper is to use the nbiomWaldTest to estimate differential abundance using the "default" options in DEseq2. It isn't clear to me from your paper whether you used independent filtering (by default, set to TRUE in DEseq2) or Cook's outlier detection in your analyses.
  2. DEseq2 has two available methods for count normalization,the variance stabilizing transformation (vst) and the regularized log tranformation, the latter of which is preferred for samples with widely varying size factors. In your study, did you test both of these transformations and if so, do you have a sense of the size factors at which regularized log transformation becomes preferable over the vst?

Thanks for your insights!

Kristina

joey711 commented 10 years ago

Kristina,

These are good questions. I will try to address them soon.

joey

joey711 commented 10 years ago

Kristina,

The article has been accepted in PLoS Computational Biology and is we are currently working with the production staff. The snapshot of source files for the simulations and outputs will be published as a compressed supplemental file accompanying the article. I will likely also post a separate GitHub repo with the files and tutorials... and this will most-likely be associated with the arXiv version of the article, which we may update with a few extra results that did not make it into the upcoming article.

Question 1. -- I will need to go back and check, but I expect that I used the default independent filtering and not Cook's outlier detection. Now that you mention it, I would like to try both, so I will add this to the to-do list for the github repo version

Questrion 2 -- Again, I will need to go back and check, but expect that I tried only vst. Another item for the to-do list.

Figures for the publication need to be clear and concise as much as possible, but I can add many additional comparisons/figures to the github tutorials. The lingering issue motivating your questions, though, is whether one option is better than the other for your data. I plan to look into it for various datasets so that I can say something useful about it in the tutorials.

kfontanez commented 10 years ago

Joey-

Congratulations on the acceptance of your article.

Since I posted this question I ended up choosing not to use independent filtering and Cook's outlier detection for my metagenomic data.

Since the treatments I am using are resulting in vastly different microbial communities, using Cook's outlier detection resulted in the loss of some of my most differentially abundant taxa. The outlier detection makes sense for transcriptomics where you don't expect too many changes from one treatment to the next, but for metagenomics when you might expect large swings in taxonomic diversity, it just doesn't make sense to me.

The independent filtering seemed like a good idea in principal but I found it easier to do by hand in my excel spreadsheets after that fact. That way, I could see exactly what taxa I was losing at each cutoff.

As for the variance-stabilizing transformation, I actually settled on the regularized log-transformation. In part, this decision was motivated by discussions I had on the Bioconductor list with the authors of DESeq regarding the various merits of the transformations. It also made sense to use the rlog transformation because it is very similar to the transformation used during the nbinomWald test that is used to test for differential abundance.

My final approach consisted of choosing a study design in DESeq2, conducting the nbinomWald test using that study design to test for differential abundance among my samples, exporting rlog transformed count data which took into account the chosen study design, and using that data to make relative abundance heatmaps/bar charts in phyloseq. I actually exported the log2 fold change results from the nbinomWald test as well and was able to make some gorgeous heatmaps showing log fold changes among my samples in phyloseq.

So, when testing the various transformations I would suggest that you include example datasets with large swings in taxonomic diversity as well as those with more subtle changes.

Kristina

joey711 commented 10 years ago

Great! So this issue is now closed.