Difference Between ENCODE and This?

YogiOnBioinformatics commented 4 years ago

Hello @reskejak!

One thing I'm confused about. What exactly is the difference between the analysis you are doing here vs. what ENCODE-DCC has published (via Anshul Kundaje lab)?

You mention there is a difference but it did not mention the specifics. I ask since I'm curious and would love to take something from this that might be useful.

Thanks, Yogi

reskejak commented 4 years ago

Overall this method and that of Kundaje/ENCODE are highly similar. As you may have suspected from the mtDNA removal tool suggestion, it also is based on certain aspects of the Harvard Informatics ATAC-seq guidelines (https://informatics.fas.harvard.edu/atac-seq-guidelines-old-version.html). Neither of these workflows focused on downstream differential analysis.

Key differences between this workflow and that of Kundaje/ENCODE:

There are no steps in this workflow to remove multi-mapped reads (non-unique alignments). This was just a personal preference decision made by our lab for our own analysis.
I believe the ENCODE method (has been updated so not entirely sure) does not remove mitochondrial reads, as is performed here. This is less of an issue with omni-ATAC and newer methods, but the original ATAC-seq protocol has high levels of mitochondrial read contamination (see the original Buenrostro et al. paper). Some cell lines we have profiled resulted in >70% mtDNA contamination!
Here, library molecular complexity estimates are calculated using preseqR functions as wrapped by ATACseqQC. We then suggest subsampling replicates to equivalent complexity using samtools before proceeding to peak calling and differential analysis, for reasons described in our paper. In ENCODE method, complexity is more of a QC metric rather than used as a functional normalization value.
We do not use IDR for replicate concordance. We have streamlined the Kundaje/ENCODE "naive overlap" as a bash script and suggest its use as a middle-ground between IDR (more conservative) and simple replicate intersects (more liberal).
Kundaje/ENCODE method has cross-correlation and other QC analyses built into the reads-to-peaks pipeline. We use downstream csaw functions to compute fragment length distributions etc. for these purposes.

I hope this helps. Please feel free to ask any additional questions.

YogiOnBioinformatics commented 4 years ago

Hello,

Thanks so much for such a wonderfully detailed explanation!

All the best, Yogi

reskejak / ATAC-seq

Difference Between ENCODE and This? #1