jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
182 stars 27 forks source link

Peak calling missing closely located peaks? #80

Closed Simso86 closed 4 months ago

Simso86 commented 2 years ago

Hi,

First of all, thank you for a great and useful tool.

I have a number of ATAC-seq samples (wild type an mutants). These have been mapped to human reference genome hg38 and I proceeded with peak calling with Genrich. First I used the parameters -q 0.05 and -a 200, removing chrM and blacklisted regions. I noticed that sometimes, peaks located close together where not called even though together they should met both the q- and a- threshold. The image below shows tracks in IGV. For each sample the pileup bedgraph converted to TDF is shown and under it the called peak from the corresponding narrowPeaks file. While a peak have been called for sample 1, this is not the case for the other samples even though some of them, from a visual and comparative point of view, seem to meet the -q and -a thresholds. All the tracks are set to the same scale. IGV_ATACpeaks

I figured that a possible solution could be to increase the -g parameter from the default 100 bp to 1000 bp in order to make Genrich merge peaks that belongs to the same enriched region. After all, according to my knowledge, ATAC-seq generally produce broader regions of enrichment that for example Chip-seq.

I used the following code for each sorted and indexed BAM file (the only difference from before is the addition of the -g parameter):

Genrich -t Sample.bam -o Results.narrowPeak -f Results_pq.bdg -k Results_pileup.bdg -b Results_bed -r -j -q 0.05 -a 200 -g 1000 -v -e chrM -E hg38-blacklist.v2.bed.gz

While this did capture broader regions of enrichment and also some of the peaks seen in the picture above, browsing the tracks I still noticed places where rather obvious peaks was not called (see example image below, tracks set to the same scale).

IGV_ATACpeaks_g1000

Here I would expect the peaks in sample 3 (and maybe also sample 6) to be called, because they seem to be similar in AUC to that of sample 1, which was called.

My question is: Should I try an even higher value for the -g parameter? Is this a correct reasoning, or are there other ways to make sure peak calling does not miss regions of enrichment like the ones above?

Best regards, Simon Söderholm

jsh58 commented 2 years ago
[...] peaks located close together where not called even though together they should met both the q- and a- threshold [...]
[...] some of them, from a visual and comparative point of view, seem to meet the -q and -a thresholds[...]
[...] I still noticed places where rather obvious peaks was not called[...]

Genrich does not call peaks based on what "should" meet a threshold, or what is "rather obvious"ly a peak. As described in the README, it calls peaks based on a concrete evaluation of alignments using a statistical model. You can understand why Genrich did nor did not call a peak by examining the -f <file>.

If there is an error in alignment parsing, please report it. If the peak-calling parameters do not result in peaks that are expected, please alter them to suit your needs. The -g parameter controls the linking together of sites that achieve significance; it will not help call a peak in a region where nothing reaches the significance threshold.

Simso86 commented 2 years ago

Thank you for your very quick reply, and for clarifying the -g parameter. I am fully aware that Genrich uses a statistical model to calculate significant enrichment based on alignments. I am sorry that my questions were a bit unclear. I was simply seeking advice on how to interpret my results. I realize that I had misunderstood the -g parameter. Also, I guess that even though the peaks are closely located the will be identified as individual peaks and not a single region of enrichment and therefore none of them reaches the significance threshold. I will make some more peak calling tests and tweak the parameters and see if these regions could be retained, perhaps by lowering the -a parameter a bit. Otherwise, I will just have to accept that these regions are not statistically significant.

Thanks again for the clarifications. Best regards, Simon Söderholm

malcook commented 2 years ago

If I may chime in on this conversation….

I am having fine success with -a 0 (with 3 replicates) which to my understanding results in peaks that are significant according to the model and the desired p (or q) value threshold chosen (I having been using -q .05)

Using -a of anything other than zero is not going to yield "more significant" peaks. It will filter the peaks based on the "auc", which is really the area between two curves, being the -log10 of the p-value corresponding to rejecting the NULL hypothesis and that of the empirical test statistic. The name auc sounds technical but is really a heuristic about which developing intuition is challenging given that the same auc can result from quite different relationship between the two curves. So, using anything other than -auc 0 is really deviating from using the statistical model Genrich implements, and IMO clouds interpretation.

@jsh58 - I'd appreciate your feedback on my remarks.

@Simso86 - it is a little hard to appreciate your observations. Are we to understand that there are empty tracks showing no peaks for all samples other than Sample1? You write "tracks set to the same scale"; it may help to know: what is that scale and what are the units of the tracks & how were they produced?

jsh58 commented 2 years ago

@malcook I agree with your analysis in general. The AUC is indeed a heuristic that is intended to produce reasonable results for most experiments, and also allow researchers to evaluate and adjust peak-calling parameters in a way that macs2 couldn't. It may well be the case that your experiment, with three replicates, is sufficiently powered that you can produce near-optimal results with -a 0. Great. Many experiments are underpowered and may not do so well without increasing the AUC threshold.