Best peak-calling strategy for differential chromatin accessibility analysis: individual versus concatenated BAMs

pgugger commented 5 years ago

I am curious about your thoughts on the best ATAC-Seq analysis strategy for differential accessibility analyses among experimental groups. We have data from a few related experiments totaling 8 conditions with 4 biological replicates per condition (not all pairs of conditions are of interest, though). Because of the number of replicates and conditions, there are a number of ways I could see running the peak calling in Genrich to generate a consensus set of peaks for subsequent read/fragment counting per peak (e.g., featureCounts) and differential accessibility analysis (e.g., DESeq2). Some possibilities include:

1) Concatenate all read data into a single BAM and provide that single file as input 2) Provide a comma-separated list of all individual BAM files, regardless of experimental condition 3) Provide comma-separated lists for each condition, and run each condition separately. Then, use bedtools merge to form the consensus among the various resulting narrowPeak files

My inclination is towards one of the first two options that use all the data at once, primarily because of the arguments presented in Lun & Smyth 2014 (https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gku351). However, I would be interested in what others think, both about which is the best strategy in general and which strategy might be most appropriate in the specific case of Genrich (especially, given that Genrich can synthesize across replicates).

Thanks for your help!

jsh58 commented 5 years ago

Thanks for the very interesting question. I would also like to hear what others think, but will contribute my $0.02 first.

I have not had the time to go through the Lun & Smith MS in great detail, but I did note a heavy reliance on MACS. This made me cringe, because alignment parsing with MACS is flawed. It is difficult to build an elaborate statistical structure on a foundation of sand.

With regard to your three scenarios:

Approach 1 is essentially what the 10X Genomics software for single-cell ATAC-seq, cellranger-atac, uses (combining all of the reads together, calling peaks collectively, and then separating out the cell populations). However, they note that:

A theoretical disadvantage of such a method would be not being able to identify rare peaks appearing only in very rare populations.

This is certainly true (not sure why they used the word "theoretical" though). Though the two programs ( cellranger-atac and Genrich) use different statistical models, this disadvantage would be the same with either.

Approach 2 does not make sense to me, given the way that Genrich analyzes multiple BAM files as replicates.
Approach 3 is what I would recommend. (In fact, this came up in the answer to #19.) This approach would make use of the extra power granted by multiple replicates (and Genrich's ability to handle them), and it would not suffer the disadvantage of approach 1.

pgugger commented 5 years ago

Thanks, this makes sense. For now, I will try both approaches 1 and 3 and see how the results compare.

malcook commented 3 years ago

@pgugger - though this issue is closed, I wonder if you might share any observations you made after comparing 1 and, as you proposed to do.

jsh58 / Genrich

Best peak-calling strategy for differential chromatin accessibility analysis: individual versus concatenated BAMs #25