neurogenomics / APOE_microglia

GNU General Public License v3.0
0 stars 0 forks source link

Run ATACseq downstream analyses #2

Closed KittyMurphy closed 3 months ago

KittyMurphy commented 10 months ago
KittyMurphy commented 10 months ago

I've tried out various peak callers and parameters.

  1. macs2 callpeak -f BAMPE -g hs -B --keep-dup all -t <sample>.bam
  2. macs2 callpeak -f BEDPE -nomodel --shift -37 --extsize 73 -g hs -B --keep-dup all -t <sample>.bed

Options 1 and 2 output the same peaks (n = ~250,000), so I decided to stick with calling peaks with BAMPE as this is the direct output I get from genome alignment.

  1. macs3 callpeak -f BAMPE -g hs -B --keep-dup all --cutoff-analysis -t <sample>.bam

--cutoff-analysis helps you decide a p or q value threshold, according to my output I should be using a p cut off of 0.01. Perhaps I was being too stringent beforehand as I filtered my peaks to those with p < 0.0000001.

  1. macs3 callpeak -f BAMPE -g hs -B --keep-dup all -p 0.01 -t <sample>.bam

Overall, MACS3 output a higher number of peaks (n = ~300,000), and is suggested to be more suited to ATAC-seq data than MACS2.

  1. macs3 hmmratac -b <sample>.bam

MACS3 now implements hhmratac, which is a peak caller specifically designed for ATAC-seq data. This method output ~130,000 peaks. Of note, all the MACS algorithms output peaks with similar length distribution (50-300bp), whereas hhmratac had peak lengths even as large as 23kb!

KittyMurphy commented 9 months ago

I ran differential chromatin accessibility using DeSeq2, and had much better overlap with the differentially expressed genes (also identified using DeSeq2).

This study compared different tools for differential chromatin accessibility analysis and with our number of samples, DeSeq2 had good sensitivity and specificity.

KittyMurphy commented 9 months ago

Re: motif enrichment analysis

For example, using the top 100 upregulated peaks in E2, the mean distance to TSS of these peaks is ~40,000. In comparison, the mean distance to TSS of the background peaks is ~1,800. Could explain why the enrichments aren't super strong (as is usually seen with homer). Also, when I've seen examples of homer output the number of target and background sequences is not so different. I have 100 target sequences and 170,000 background sequences, might be worth subsetting.