How does estimate_richness deal with uneven sequencing depth?

laurenms commented 10 years ago

Hi, I apologize in advance if this is answered somewhere in the documentation that I haven't been able to find yet. How does estimate_richness deal with uneven sequencing depth? I was under the impression the function wanted raw counts as the input but should the data be transformed already (not necessarily rarefied but something like that)? Thanks! lauren

joey711 commented 10 years ago

@laurenms

You do not want to transform or filter your data before estimating richness, other than quality assurance filtering that would remove non-target sequences. Usually that sort of filtering would be done on the sequence data before it is in the form of a contingency table (table of counts). Basically, the very first table of counts that you have in your workflow is probably the one that you want to use for estimate_richness and plot_richness.

See the following tutorial for examples...

http://joey711.github.io/phyloseq/plot_richness-examples.html

laurenms commented 10 years ago

Thanks Joey! Just to clarify, the estimate_richness function does not account for sequencing depth? Then wouldn't samples with more sequence reads automatically end up with higher observed richness? lauren

On Jan 17, 2014, at 9:08 PM, "Paul J. McMurdie" notifications@github.com wrote:

@laurenms

You do not want to transform or filter your data before estimating richness, other than quality assurance filtering that would remove non-target sequences. Usually that sort of filtering would be done on the sequence data before it is in the form of a contingency table (table of counts). Basically, the very first table of counts that you have in your workflow is probably the one that you want to use for estimate_richness and plot_richness.

See the following tutorial for examples...

http://joey711.github.io/phyloseq/plot_richness-examples.html

— Reply to this email directly or view it on GitHub.

joey711 commented 10 years ago

Hi @laurenms

It's ambiguous what you mean by "account". There are several different alpha diversity estimators supported in the estimate_richness function, and they all incorporate library size into their estimates in different ways... Except for the "Observed" option, which is simply showing you graphically the OTUs that were observed at least once in each sample. I think this is the one you were actually asking about.

To be clear, the other methods require that you use raw counts. Do not use rarefied counts for the Chao-I estimate, or the Shannon index, for example. The "Observed" option is not a method at all, just showing you what is in your data as you provided. If you transform your data, especially if you rarefy, and then you want to estimate richness, the "Observed" result is now the only one still available to you at all. Rarefying reduces the precision with which you would estimate the diversity in the first place, and so this generally shouldn't be done. I suspect, however, that is the workflow you had in mind. All alpha-diversity indices/estimates are aware of differences in sample size (library size, number of reads in this case), because this has always been a problem when attempting to count things, even trees in a forest (see Sanders original paper describing rarefaction, a different technique than rarefying, if rarefying can be called a technique).

A point of procedure:

if I've delayed responding to an issue, please do not open it up as a new issue to get my attention. You can just as well get my attention by posting another comment on the same issue, and at least that way things will stay organized.

e.g. https://github.com/joey711/phyloseq/issues/289

My slowness can't be helped, but reminders are always welcome O:-)

laurenms commented 10 years ago

Thanks Joey! Sorry about the duplicate post. I wasn't trying to rush you I was worried you might not see the post after the issue was closed. I'm sorry to have asked an ambiguous question. I think I was more confused about how the ChaoI and Shannon index calculations deal with library size/number of reads which might be a question for a stats book instead of bothering you.

----- Original Message ----- From: "Paul J. McMurdie" notifications@github.com To: "joey711/phyloseq" phyloseq@noreply.github.com Cc: "laurenms" lms6@stanford.edu Sent: Friday, January 24, 2014 3:35:16 PM Subject: Re: [phyloseq] How does estimate_richness deal with uneven sequencing depth? (#287)

Hi @laurenms

It's ambiguous what you mean by "account". There are several different alpha diversity estimators supported in the estimate_richness function, and they all incorporate library size into their estimates in different ways... Except for the "Observed" option, which is simply showing you graphically the OTUs that were observed at least once in each sample. I think this is the one you were actually asking about.

To be clear, the other methods require that you use raw counts. Do not use rarefied counts for the Chao-I estimate, or the Shannon index, for example. The "Observed" option is not a method at all, just showing you what is in your data as you provided. If you transform your data, especially if you rarefy, and then you want to estimate richness, the "Observed" result is now the only one still available to you at all. Rarefying reduces the precision with which you would estimate the diversity in the first place, and so this generally shouldn't be done. I suspect, however, that is the workflow you had in mind. All alpha-diversity indices/estimates are aware of differences in sample size (library size, number of reads in this case), because this has always been a problem when attempting to count things, even trees in a forest (see Sanders original paper describing rarefaction, a different technique than rarefying, if rarefying can be called a technique).

A point of procedure:

if I've delayed responding to an issue, please do not open it up as a new issue to get my attention. You can just as well get my attention by posting another comment on the same issue, and at least that way things will stay organized.

e.g. https://github.com/joey711/phyloseq/issues/289

My slowness can't be helped, but reminders are always welcome O:-)

Reply to this email directly or view it on GitHub: https://github.com/joey711/phyloseq/issues/287#issuecomment-33272110

joey711 commented 10 years ago

No worries!

Yes, Chao-I and Shannon are long-standing and well-documented methods. Shannon Index is, well, and index, rather than an estimate of the number of species. An important difference from Chao-I is that it also incorporates the distribution of species (OTUs) in its valuation.

laurenms commented 10 years ago

Thank you!!

----- Original Message ----- From: "Paul J. McMurdie" notifications@github.com To: "joey711/phyloseq" phyloseq@noreply.github.com Cc: "laurenms" lms6@stanford.edu Sent: Friday, January 24, 2014 3:48:52 PM Subject: Re: [phyloseq] How does estimate_richness deal with uneven sequencing depth? (#287)

No worries!

Yes, Chao-I and Shannon are long-standing and well-documented methods. Shannon Index is, well, and index, rather than an estimate of the number of species. An important difference from Chao-I is that it also incorporates the distribution of species (OTUs) in its valuation.

Reply to this email directly or view it on GitHub: https://github.com/joey711/phyloseq/issues/287#issuecomment-33272867

sargdavid commented 4 years ago

Hi Paul,

I know this is a closed issue so I can open a new one if you prefer but I still have same question that @laurenms had - how do you take the sequencing depth into account. I am looking at your function and here is what it looks like for, e.g. Shannon's index:

_if ("Shannon" %in% measures) {
    outlist <- c(outlist, list(shannon = diversity(OTU, index = "shannon")))
}_

i.e you are using vegan::diversity, correct? If so, that function does not seem to have any adjustment for the sequencing depth as it simply calculates *_-plog(p)_**.

As @laurenms noted in the very beginning, samples with larger number of reads will have a better chance of detecting more OTUs hence will end up with higher Shannon index. Am I missing something?

Thank you.

samd1993 commented 4 years ago

I would be worried about this too since my data is hinging on the fact that I get a significant result for richness but the seq. depth isnt even...does the log help mitigate seq. depth issues since that in itself is a transformation?

RomGallet commented 1 year ago

Dear Joey, I have the same question regarding the relationship between sequencing depth and Diversity index. I'm working on a Gut microbiome dataset, in which I have thousands of samples. Of course, their sequencing depth differ, but the rarefaction curves seem to indicate that I reach the plateau is all cases, and thus, that I should not have any issues in detecting diversity. However, when I plot "number of reads" vs "Shannon index" I find a significant correlation between these factors. I'm not sure how I can solve this problem. Do you have any idea? Thank you in advance.

joey711 / phyloseq

How does estimate_richness deal with uneven sequencing depth? #287