joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
584 stars 187 forks source link

Cutoff for filtering #1115

Open LenaLapidot opened 5 years ago

LenaLapidot commented 5 years ago

Hello, I have to plot a histogram of the relative abundance of the different ASVs, and based on this suggest a cutoff for removing low abundance ASVs. How can I plot the relative abundance of ASVs across all samples? I have a very large phyloseq file.. Thank you, Lena

mikemc commented 5 years ago

If all you want is a histogram of the mean proportions than you can do

ps.prop <- transform_sample_counts(ps, function(x) x / sum(x))
asv_means  <- taxa_sums(ps.prop) / nsamples(ps.prop)
qplot(asv_means, log = "x") +
    geom_vline(xintercept = ???) # Replace ??? with a hypothetical cutoff

But you might also want to look at other properties such as the prevalence and maximum frequency. E.g., you might want to keep ASVs that are rare overall but reach high frequency in 1 or a couple samples.

LenaLapidot commented 5 years ago

Thank you Michael. I am trying to find out what cutoff I should use. That's why I'm trying to plot the relative abundance of the ASVs, but there are many, so my histogram doesn't provide the information that I'm looking for.

How would you suggest to determine the cutoff for filtering low abundant ASVs?

mikemc commented 5 years ago

It depends both on your experimental design and your analysis goals. Why are you filtering? Is it because you think very rare things may be false positives or contaminants? And do you think rare things may be playing a biological role in your system? Often people will filter very low prevalence ASVs before doing community-wide differential abundance testing (e.g. with DESeq2 or ALDEx2) as you will have no power to measure differences in these taxa and this will reduce the number of comparisons. Or simply to be able to make a heatmap or tree plot that is more readable. Unfortunately there is no universal right answer. You might also see the section on "Prevalence filtering" at https://f1000research.com/articles/5-1492/v2

Whatever filter you choose (and you might experiment with a few), a good sanity check is to see how many ASVs and what fraction of the reads you keep, and perhaps make sure you aren't throwing out too many reads from individual samples. If ps is your original phyloseq object and ps0 is the phyloseq object after filtering with Abundance as read counts, then you can use print(ps0) or ntaxa(ps0) to check how many ASVs were filtered and something like

frac_reads_kept <- sample_sums(ps0) / sample_sums(ps)
summary(frac_reads_kept)

to check the overall and max number of reads thrown out per sample.

LenaLapidot commented 5 years ago

Thank you for the prompt reply. In this setting, I'm not interested in the rare ASVs.. So I'm trying to plot the relative abundance of the ASVs and determine a cutoff. According to the code you've sent, I get this histogram. Does it make sense to determine the cutoff to 0.001? image

mikemc commented 5 years ago

I can't say without knowing what analysis will come next, but no, I don't think that would make sense as that would be throwing out almost all of the ASVs. I would instead consider filtering based on prevalence and/or minimum abundance, rather than the mean. Or use tax_glom or tip_glom to merge ASVs rather than throw them away. The only safe general answer is to try your analysis and plots with different filterings to make sure your conclusions aren't being overly influenced by your filter decision.

LenaLapidot commented 5 years ago

Ok, so how would you suggest to filter based on prevalence or minimum abundance?

LenaLapidot commented 5 years ago

How would you determine the filtering cutoff for low abundant ASVs? In the Phyloseq tutorial it's 0.0005, but I don't understand how it was decided.

mikemc commented 5 years ago

As I've said and is said in the F1000 article, there is no way to answer what thresholds to use without knowing your experiment, data, and analysis goals, and so isn't something I can answer. The F1000 article gives one possible way to try to pick a prevalence filter.

LenaLapidot commented 5 years ago

I'll try the approach in the article, thank you.

LenaLapidot commented 5 years ago

How would you suggest to filter and plot according to minimum abundance?