jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
185 stars 27 forks source link

Best peak-calling strategy for differential chromatin accessibility analysis: individual versus concatenated BAMs #25

Closed pgugger closed 4 years ago

pgugger commented 5 years ago

I am curious about your thoughts on the best ATAC-Seq analysis strategy for differential accessibility analyses among experimental groups. We have data from a few related experiments totaling 8 conditions with 4 biological replicates per condition (not all pairs of conditions are of interest, though). Because of the number of replicates and conditions, there are a number of ways I could see running the peak calling in Genrich to generate a consensus set of peaks for subsequent read/fragment counting per peak (e.g., featureCounts) and differential accessibility analysis (e.g., DESeq2). Some possibilities include:

1) Concatenate all read data into a single BAM and provide that single file as input 2) Provide a comma-separated list of all individual BAM files, regardless of experimental condition 3) Provide comma-separated lists for each condition, and run each condition separately. Then, use bedtools merge to form the consensus among the various resulting narrowPeak files

My inclination is towards one of the first two options that use all the data at once, primarily because of the arguments presented in Lun & Smyth 2014 (https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gku351). However, I would be interested in what others think, both about which is the best strategy in general and which strategy might be most appropriate in the specific case of Genrich (especially, given that Genrich can synthesize across replicates).

Thanks for your help!

jsh58 commented 5 years ago

Thanks for the very interesting question. I would also like to hear what others think, but will contribute my $0.02 first.

I have not had the time to go through the Lun & Smith MS in great detail, but I did note a heavy reliance on MACS. This made me cringe, because alignment parsing with MACS is flawed. It is difficult to build an elaborate statistical structure on a foundation of sand.

With regard to your three scenarios:

A theoretical disadvantage of such a method would be not being able to identify rare peaks appearing only in very rare populations.

This is certainly true (not sure why they used the word "theoretical" though). Though the two programs ( cellranger-atac and Genrich) use different statistical models, this disadvantage would be the same with either.

pgugger commented 5 years ago

Thanks, this makes sense. For now, I will try both approaches 1 and 3 and see how the results compare.

malcook commented 3 years ago

@pgugger - though this issue is closed, I wonder if you might share any observations you made after comparing 1 and, as you proposed to do.