spholmes / F1000_workflow

43 stars 33 forks source link

Effects of prevalence filtering on downstream analysis #28

Open zefrieira opened 6 years ago

zefrieira commented 6 years ago

Hi,

I'm following the "Workflow for Microbiome Data Analysis" to analyse some 16S and ITS datasets. I'm unsure about the effects that prevalence filtering could have in downstream analysis I'm planning to do (a differential abundance analysis with DESeq2, for example).

This is how I'm performing prevalence filtering right now:

ps_16s_relative <- transform_sample_counts(ps_16s, function(x) x/sum(x))

prevalence_16s <- apply(X = otu_table(ps_16s_relative),
                        MARGIN = 1,
                        FUN = function(x) sum(x >= 0.0001))

prevalence_df_16s <- data.frame(prevalence = prevalence_16s,
                                relative_prevalence = prevalence_16s/nsamples(ps_16s_relative),
                                total_abundance = taxa_sums(ps_16s_relative),
                                tax_table(ps_16s_relative))

ps_16s <- prune_taxa(rownames(prevalence_df_16s)[(prevalence_df_16s$relative_prevalence >= 0.05)], ps_16s)

And these are the numbers I'm getting (with the 16S data):

'Number of ASVs before the prevalence filtering: 28582'
'Number of phyla before the prevalence filtering: 40'

'Number of ASVs after the prevalence filtering: 5762'
'Number of phyla after the prevalence filtering: 23'

I'm felling that even though the filtered data may be a better representation of the microbial community, I'm losing too much data.

Is the prevalence filtered data recommended for downstream analysis?

spholmes commented 6 years ago

It depends on what you mean by downstream and you should also pay attention to the threshold you choose, it should reflect how many ASV's you truly believe to be there. Filtering is a very important step if you want to do multiple testing later on, as if you keep too many ASV's then none will end up significant.

See Chapter 6 here: http://web.stanford.edu/class/bios221/book/Chap-Testing.html#Chap:Testing

On Fri, Oct 5, 2018 at 10:26 AM zefrieira notifications@github.com wrote:

Hi,

I'm following the "Workflow for Microbiome Data Analysis" http://web.stanford.edu/class/bios221/MicrobiomeWorkflowII.html to analyse some 16S and ITS datasets. I'm unsure about the effects that prevalence filtering could have in downstream analysis I'm planning to do (a differential abundance analysis with DESeq2, for example).

This is how I'm performing prevalence filtering right now:

ps_16s_relative <- transform_sample_counts(ps_16s, function(x) x/sum(x))

prevalence_16s <- apply(X = otu_table(ps_16s_relative), MARGIN = 1, FUN = function(x) sum(x >= 0.0001))

prevalence_df_16s <- data.frame(prevalence = prevalence_16s, relative_prevalence = prevalence_16s/nsamples(ps_16s_relative), total_abundance = taxa_sums(ps_16s_relative), tax_table(ps_16s_relative))

ps_16s <- prune_taxa(rownames(prevalence_df_16s)[(prevalence_df_16s$relative_prevalence >= 0.05)], ps_16s)

And these are the numbers I'm getting (with the 16S data):

'Number of ASVs before the prevalence filtering: 28582' 'Number of phyla before the prevalence filtering: 40'

'Number of ASVs after the prevalence filtering: 5762' 'Number of phyla after the prevalence filtering: 23'

I'm felling that even though the filtered data may be a better representation of the microbial community, I'm losing too much data.

Is the prevalence filtered data recommended for downstream analysis?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/spholmes/F1000_workflow/issues/28, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvR5uO5t0kCI_2oEKp0p8LNVxd1i8ks5uh5ZhgaJpZM4XKoCn .

-- Susan Holmes John Henry Samter Fellow in Undergraduate Education Professor, Statistics 2017-2018 CASBS Fellow, Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/

zefrieira commented 6 years ago

I'm transitioning from RNA-Seq analysis, where I would leave DESeq2 use it's built-in expression filters to remove lowly-expressed genes, so I'm having some trouble to establish the prevalence filter thresholds. Currently, I'm considering a given ASV present if it's abundance is at least 0.01% of the total sample reads (I'm using relative abundances because my samples have a highly variable number of reads). Then, I filter out ASVs that are present in less than 5% of the samples.

The downstream analysis I'm planning to do are PERMANOVA (to check whether samples differ between groups), DESeq2+ZINBWAVE differential abundance tests and discovering the "core" ASVs of each group.

Thank you for the resources!