Closed joey711 closed 10 years ago
Done. This was added in version 1.7.3
:
https://github.com/joey711/phyloseq/pull/253
This new function,
phyloseq_to_deseq2()
facilitates our recommended approach to using the Negative Binomial to model microbiome read count data, as a general alternative to sample proportions or rarefying, as we recently described in a draft article on the arxiv:
Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible
This has been submitted for peer-review, along with four supplemental tutorials/simulations. We will host update-able versions of those among the phyloseq tutorials once the article has been accepted.
Joey-
I just read your ArXiv draft - very nice work. A few clarification questions about rarefying and it's implications.
My understanding of your work is that rarefying to an even sampling depth without replacement can be justified in some cases when you are comparing your samples using simple relative abundance in order to do clustering, ordination or multivariate analyses (e.g., non-parametric analysis of dissimilarities). As long as your sample sizes are large enough (effect sizes) after rarefying that you still capture the relevant clustering patterns - you should be good to go.
However, if you want to really test for differential abundance then you need to use your raw count data in a negative binomial model such as those implemented in DESeq or BaySeq. The raw count data is not rarefied nor normalized in any way prior to input into those programs because the negative bionomial models the count data appropriately to handle the uneven sampling depth.
Would you agree that those are two of the major conclusions of this work?
By the way, is there a reason you used the negative binomial as implemented in DeSeq as opposed to the negative binomial implemented in BaySeq?
Thanks, Kristina
Joey-
Another follow-up set of questions:
In your paper you suggest that repeated rarefying trials are better than a single rarefy. Is there a way to do this in the phyloseq package? I see that the rarefy_even_depth function can be used to rarefy - but how do you combine the repeated trials?
As an alternative to repeated rarefying you suggest transforming each count value to a common-scale using the minimum library size across your samples. For example, you have 1000 counts for OTU1 in sample A. There are three samples, A, B and C with library sizes of 80,000, 25,000 and 150,000, respectively. To transform the sample counts for OTU1 you would do the following calculation: 1000(25,000/80,000) = 313 (rounded). Is that correct?
Thanks, Kristina
Kristina,
Thanks for the feedback. I'm addressing reviewer comments on this manuscript, and will try to fully respond to your two posts above as soon as I can. Basically, the implementation details will have to wait, and we are providing a lot of that with the final manuscript when it ultimately gets published.
I want to correct some points in your summary from the first post, because I don't want lingering misunderstanding about the conclusions of our manuscript (perhaps we were stating things to cautiously).
Hope my answers help to clarify. Thank you very much for the feedback. It really helps a lot. Being able to consider your comments before we're finished making revisions for final publication is the main reason we wanted to post on the arxiv ahead of submission. It was my first time trying that, and I think it convinced me that it is was worthwhile. I will go back and try to re-read our latest draft with your comments in mind, and try to find where I can make some of these things more clear.
Thanks again for all your other feedback on the package, and best of luck with your research,
joey
Joey-
Thank you very much for your detailed comments and corrections of my interpretations. I know that it can be difficult to be diplomatic yet firm about conclusions in manuscripts especially when microbiome researchers are wearing rose-colored glasses as they read the manuscript!
It is indeed difficult to move away from rarefying as a standard procedure in these types of studies. At the moment, I find myself at that exact juncture where I need to decide how to model the count data when presenting comparisons of abundance across samples. Using BaySeq is quite easy to do and is able to detect differential abundance of OTUs across samples even for OTUs that are less than 0.1% of total reads in the dataset! (Funny that you used that exact number). It of course picks up the OTUs that are 5% or more of the dataset as well, but in my study I am interested in the differential abundance of those rare taxa as well.
One way the manuscript might be improved for an audience of microbiome researchers looking to follow your recommendations is to have a section in the discussion or Implications section where you explicitly state the steps they ought to take (akin to the way you described rarefying in the Introduction) starting from raw (untransformed, non-normalized OTU by count tables). Figure 2 is useful in understanding your simulations but that explicit enough to follow when trying to figure out how to proceed with your own data.
It may be that you have included that information in the supplementary sections, which I was not able to access on ArXiv. So, take the above suggestion with a grain of salt.
For my own part, I would be very interested in reading your tutorial about the DEseq implementation of modeling the count data. I tried finding it on Github and on your phyloseq tutorial pages but was only able to find a few sections of code describing it, though not in the easy to read R markdown format including the figures.
Thanks for your contributions towards improving statistical techniques in microbiome research!
Kristina
Joey-
Just a quick note to say that I was able to access the supplemental material on ArXiv and I'm going through it now. Would still love to see the DEseq tutorial though!
Kristina
FYI, the supplemental material on the arxiv includes only the mathematical supplement, but none of our Rmd tutorials. We will release Rmd tutorials when the manuscript is accepted for publication in its final form. One exception is the DESeq2 tutorial that is already included as a vignette in the latest version of phyloseq on GitHub and BioC-devel.
Thanks for the additional feedback on the manuscript. We are definitely taking your comments into consideration.
joey
Joey-
I have a newbie R question- I can't access the vignette.
vignette("phyloseq-mixture-models", package="phyloseq") Warning message: vignette ‘phyloseq-mixture-models’ has no PDF/HTML
What's the rookie mistake here?
Kristina
Actually this is a good question. I just noticed that the github-installed version does not have the vignettes built in the package. I imagine this is a developer-oriented design choice in the devtools::install_github
function to speed up (re-)install time. I happen to have the BioC-devel version up to date, so you can download and re-install from there, and the vignettes will be pre-built. Not that there is on difference in the source code between the two, but the BioC servers provide pre-built "binary" versions for Mac OS and Windows. You could "build" the github version and get the same results, but it is more complicated (hint: you would use R CMD build phyloseq
from the command line after downloading/unpacking the phyloseq repository)
Also, make sure you are using version 1.7.10+
.
See the devel page for phyloseq on BioC: http://www.bioconductor.org/packages/2.14/bioc/html/phyloseq.html
However, I noticed that their daily-build system cleared version 1.7.10
as being built okay on Windows and Mac OS:
http://www.bioconductor.org/checkResults/devel/bioc-LATEST/phyloseq/moscato2-buildsrc.html
but they haven't yet provided the pre-built binaries for 1.7.10
. So, only 1.7.9
is available in the links at the bottom. However, the vignettes are up to date in that version, and they also provide links to the vignettes on that page.
So short answer: view the vignette at the following link http://www.bioconductor.org/packages/2.14/bioc/vignettes/phyloseq/inst/doc/phyloseq-mixture-models.html
The binaries are now version 1.7.10
and should include the pre-built vignettes. You can, however, still use the links above to see the vignettes if you'd like. Either way, they are available (and the source code Rmd for the vignettes was always available).
I'll close this issue for now unless/until something comes up.
Thanks again for the feedback!
joey
Hi Joey, I'm very interest in this debate about the rarefying problem. Recently I have found another work on this subject titled "Effects of library size variance, sparsity, and compositionality on the analysis of microbiome data" and at least thay conclude saying "Rarefying yields fewer OTUs as significantly differentially abundant, but those OTUs are robust, in the sense that thay are almost always identified as significant by at least one other differencial abundance detection model"
Thus i'm bit confuse about which is the right test to use on microbiome case, could you give me any comments about this paper or well this phrase? thank you for the help
Andrea I think that rarefying was used to overcome two problems, qiime and mothur are oversensitive to noise because there is no denoising component so rarefying can decrease this noise by looking at the data at a lower resolution (from afar so to speak).
Now that we are using a denoised pipeline (dada2) this does not make any sense, so one would not use rarefying any more if you are denoising.
The other reason to rarefy is to normalize the data and there are more powerful ways of doing this using transformations of the data whether through DEseq2 or ranking or other log type transformations. So transformations and more powerful procedures should be used for differential abundance testing;
however if you just want to look at unifrac distances you should still rarefy.
Hope this helps, Susan
On Thu, Feb 9, 2017 at 1:41 AM, AndreaQ7 notifications@github.com wrote:
Hi Joey, I'm very interest in this debate about the rarefying problem. Recently I have found another work on this subject titled "Effects of library size variance, sparsity, and compositionality on the analysis of microbiome data" and at least thay conclude saying "Rarefying yields fewer OTUs as significantly differentially abundant, but those OTUs are robust, in the sense that thay are almost always identified as significant by at least one other differencial abundance detection model"
Thus i'm bit confuse about which is the right test to use on microbiome case, could you give me any comments about this paper or well this phrase? thank you for the help
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/joey711/phyloseq/issues/229#issuecomment-278592854, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvRmy7qgdNU719Mkq1PejkWMqL1oSks5rat86gaJpZM4A24E9 .
-- Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/
Thank you a lot!! Now this is much more clear to me. Still one question about this assumption: If the most powerful way to normalize data is DESeq2, so why use rarefying for the beta-diversity? I mean, why do not use the normalized dataset by DESeq2 to investigate even at the beta diversity? thank you a lot for your suggestions Susan, this helped me a lot in understanding
It's very hard to use unifrac and diversity indicies at unequal depths one has the alternative though of using DPCoA and other affiliated methods that take the tree `into account'.
On Fri, Feb 10, 2017 at 1:23 AM, AndreaQ7 notifications@github.com wrote:
Thank you a lot!! Now this is much more clear to me. Still one question about this assumption: If the most powerful way to normalize data is DESeq2, so why use rarefying for the beta-diversity? I mean, why do not use the normalized dataset by DESeq2 to investigate even at the beta diversity? thank you a lot for your suggestions Susan, this helped me a lot in understanding
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/joey711/phyloseq/issues/229#issuecomment-278896065, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvXWBTYawdFPInw1K11PnONE21LRiks5rbCyNgaJpZM4A24E9 .
-- Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/
Leaving it vague for now. Big updates will happen soon.
joey