Run ATACseq downstream analyses

KittyMurphy commented 10 months ago

[x] Peak calling
[x] Read counts
[x] Differential chromatin accessibility analysis: try csaw and DeSeq2
[x] Functional enrichment analysis
[x] s-LDSC
[x] Motif enrichment analysis
[x] Identify which peaks contain VDR
[ ] TF footprinting (can't do this as it's recommended to use samples with ~200 million reads)
[x] correlate logFC between genes in RNAseq analysis and the logFC of the promoter regions using ATACseq data --> use ATAC-seq peaks that are within 5kb of the TSS (maybe use top 100/200 for this)
[ ] run pathway enrichment analysis using genes identified in RNAseq and ATACseq (not enough genes)

KittyMurphy commented 10 months ago

I've tried out various peak callers and parameters.

macs2 callpeak -f BAMPE -g hs -B --keep-dup all -t <sample>.bam
macs2 callpeak -f BEDPE -nomodel --shift -37 --extsize 73 -g hs -B --keep-dup all -t <sample>.bed

Options 1 and 2 output the same peaks (n = ~250,000), so I decided to stick with calling peaks with BAMPE as this is the direct output I get from genome alignment.

macs3 callpeak -f BAMPE -g hs -B --keep-dup all --cutoff-analysis -t <sample>.bam

--cutoff-analysis helps you decide a p or q value threshold, according to my output I should be using a p cut off of 0.01. Perhaps I was being too stringent beforehand as I filtered my peaks to those with p < 0.0000001.

macs3 callpeak -f BAMPE -g hs -B --keep-dup all -p 0.01 -t <sample>.bam

Overall, MACS3 output a higher number of peaks (n = ~300,000), and is suggested to be more suited to ATAC-seq data than MACS2.

macs3 hmmratac -b <sample>.bam

MACS3 now implements hhmratac, which is a peak caller specifically designed for ATAC-seq data. This method output ~130,000 peaks. Of note, all the MACS algorithms output peaks with similar length distribution (50-300bp), whereas hhmratac had peak lengths even as large as 23kb!

KittyMurphy commented 9 months ago

I ran differential chromatin accessibility using DeSeq2, and had much better overlap with the differentially expressed genes (also identified using DeSeq2).

This study compared different tools for differential chromatin accessibility analysis and with our number of samples, DeSeq2 had good sensitivity and specificity.

KittyMurphy commented 9 months ago

Re: motif enrichment analysis

For example, using the top 100 upregulated peaks in E2, the mean distance to TSS of these peaks is ~40,000. In comparison, the mean distance to TSS of the background peaks is ~1,800. Could explain why the enrichments aren't super strong (as is usually seen with homer). Also, when I've seen examples of homer output the number of target and background sequences is not so different. I have 100 target sequences and 170,000 background sequences, might be worth subsetting.

neurogenomics / APOE_microglia

Run ATACseq downstream analyses #2