joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
579 stars 188 forks source link

Support transformation of OTU counts according to the copies of 16S gene in a given species #636

Closed wolass closed 7 years ago

wolass commented 8 years ago

The problem with NGS data is that there might be multiple copies of a selected gene (in this case S16) which vary in numbers in various species.

If one would use a set of primers the results of their amplification might be sometimes 5 times more than the actual number of genome equivalents in a given sample.

So for example If you did NGS of bacteria from skin there will be S. aureus but the number of bacteria is 5 times lower than the number of read OTUs by MiSeq. But for P. acnes the number of copies will be 3.

There should be a way to account for this difference. Could we implement such a feature that one would give a database of known copies of a gene (optimally for a given primer pair used) per species? this would greatly increase the results.

One approach is given here: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002743

(with R code!!)

I hope you like the idea. Best. W

spholmes commented 8 years ago

W, This has been on my mind some time and I am glad you bring it up, one of the things we'd like to do is have a faithful copy number variation database we can use. To be honest with the new DADA2 package we are already suffering from referencing the RSV (variants), if we add that for those variants we need to know the copy numbers to make valid inferences we are actually quite far from having accurate data. I know the paper and respect the authors you cite and will look for the data that would enable us to implement this. Thanks for the suggestion. Susan

On Fri, Jul 8, 2016 at 3:50 AM, Wojciech Francuzik <notifications@github.com

wrote:

The problem with NGS data is that there might be multiple copies of a selected gene (in this case S16) which vary in numbers in various species.

If one would use a set of primers the results of their amplification might be sometimes 5 times more than the actual number of genome equivalents in a given sample.

So for example If you did NGS of bacteria from skin there will be S. aureus but the number of bacteria is 5 times lower than the number of read OTUs by MiSeq. But for P. acnes the number of copies will be 3.

There should be a way to account for this difference. Could we implement such a feature that one would give a database of known copies of a gene (optimally for a given primer pair used) per species? this would greatly increase the results.

One approach is given here:

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002743

(with R code!!)

I hope you like the idea. Best. W

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/joey711/phyloseq/issues/636, or mute the thread https://github.com/notifications/unsubscribe/ABJcvfVUkWv5w1qFY2aLSLRZHOiRQmj_ks5qTiuAgaJpZM4JH7I3 .

Susan Holmes Professor, Statistics and BioX John Henry Samter Fellow in Undergraduate Education Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/

joey711 commented 8 years ago

I agree with supporting this, though I think the concern is a little bit exaggerated.

While the DADA2 results encounter this much more often than 97% OTUs, the worst-case scenario is that we have variables that are highly correlated because they always appear within the same genome of the same cell. However, these variables will have sequences that are very similar, so anything that utilizes taxonomy or tree would resolve much/all of this automatically. It would be exceedingly rare to have intragenomic 16S rRNA variants that are more than a handful of bases different across their whole sequence.

The main special case I see for this seems to be a correction for richness / alpha-diversity estimates, where we'd slightly overestimate richness by some idiosyncratic factor according to the weighted average intragenomic-variant per strain. For most applications I can live with this already, but an effective correction would still be nice.

Again, a lot of the other types of inference we're doing must already address high correlations among closely related taxa because this is expected, for instance, among many strains of the same species behaving very similarly.

wolass commented 8 years ago

However, these variables will have sequences that are very similar, so anything that utilizes taxonomy or tree would resolve much/all of this automatically. It would be exceedingly rare to have intragenomic 16S rRNA variants that are more than a handful of bases different across their whole sequence.

I'm not sure we are on the same page here. You are talking about different gene variants in strains and I am talking about the repeated number of the 16S genes in one bacterial cell which is strain/species specific.

So I think this has a great impact on the resulting abundance calculation, beacuse if we then relatively compare bacteria which have polarizing 16S gene number in their genome we might come up with results that are far from truth.

Consider the following example:

We did NGS and came up with the folowing abundances based on the READS: OTU1 = 5000 reads OTU2 = 1000 reads

We conclude that there were 5 times more bacterial CELLS of OTU1 then OTU2 in our sample.

THE ABOVE STATEMENT IS FALSE, because if we would account for how many times the 16S gene is present in the OTU1 genome (lets say OTU1 has 5 copies of the 16S in its genome and OTU2 has only one copy per genome) we would come up with the conclusion that in fact there was an equal number of cells (or better yet "genome equivalents") in our sample.

This changes every relative abundance comparison between bacterial species...

joey711 commented 8 years ago

yes, of course, the copy number difference is well documented. As is PCR bias, extraction bias, etc.

However, if you're trying to use 16S amplicon data to estimate the number of cells without a standardized reference dataset that would account for other structural bias (PCR, extraction, etc), then you're on a soft foundation. This also assumes that there's something meaningful about the comparison of amplicon-based relative abundances with a precision < 10-fold or so. I can't think of many studies that have done this convincingly. Usually it is the change across different specimens that is most biologically informative. Perhaps you have a special case, like a carefully modeled system for which the precise ratio of genomes is meaningful.

If you can point to a nicely formatted reference DB that has counted the number of 16S copies in each genome, then I can demonstrate some code that would do this transformation from "amplicon reads" to "genome equivalents".

In short, this is a simple transformation that is already possible to do with phyloseq. Suppose you have 100 reads assigned to Genome A, and you know that Genome A has 5 copies (whether or not they're identical doesn't matter as long as you've assigned them correctly to Genome A). Then you have 20 "genome equivalents"... Note that for comparing abundances between taxa in the same sample, you'll want to be able to perform the "genome equivalents" transformation for all, as a partially transformed sample is not so easy to interpret, but a "relative abundance of genome equivalents" is pretty much business as usual (but a little better for certain cases).

Also, importantly, in many cases you're comparing changes in abundance of a particular 16S sequence across samples. In this setting, the genome copy number transformation is not necessary, as it is an invariant factor applied to every observation. Furthermore, doing this transformation ahead of differential abundance can skew the differential abundance statistical inference if you're using something sophisticated (like DESeq2) that models the sampling process with a Poisson mixture.