Peak calling parameters might not be ideal

ktrns commented 2 years ago

Dear all,

We had started to discuss this topic in Slack, but I decided to raise this issue so the comments won't get lost.

In brief, you currently use the macs2 settings originally suggested for ATAC-seq data: -f "BAMPE" --nomodel. I have spent some time on reading, and I find this discussion very helpful: https://twitter.com/XiChenUoM/status/1336658454866325506

In brief, by using -f "BAMPE" we focus on only 1 of the 2 cut sites of the fragment, which does not seem to be the most appropriate way to handle ATAC-seq data.

Based on what I read, converting bam to bed, and then use -f bed (or bedpe, still have to figure that out) together with --shift -75 --extsize 150 is better. You would basically use both reads R1 and R2, and create 150bp reads with the mid-point on the cut-site of the transposase. Another twitter reply to the above thread said that Encode3 used --shift -37 --extsize 73.

I will likely start to try this to see if it gives an improvement.

Thanks a lot for your valuable work on the pipeline! Katrin

ktrns commented 1 year ago

Dear all,

I have had a closer look into this issue. I believe that the way macs2 is used for this pipeline is wrong. To understand my point it is very helpful to look at the link I pasted above.

For ChIP-seq, we are interested in the region between paired reads. In contrast, for ATAC-seq, we are interested in the individual reads as they represent the cut sites of the Tn5 transposase and hence open chromatin.

As can be seen in the insert size distribution of ATAC-seq data (wavy pattern), many read pairs span one or more nucleosomes. In its current state, the pipeline produces bigWig files and calls macs2 based on fragments, i.e. regions spanned by read pairs. This means that often what we detect are nucleosomes rather than open chromatin.

One reason the issue was never really obvious in IGV is that also the bigWig files are generated based on fragments in the nf-core code. This way the "wrong" peaks match the "wrong” coverage tracks.

I gave it a try and called macs2 as suggested above in my first post:

# bam to bed
# We want to treat each cut site as a single event
bedtools bamtobed -i sampleA.sorted.bam >> sampleA.sorted.bed

# macs2
macs2 \
  callpeak \
  --keep-dup all --nomodel \
  -f BED \
  --shift -75 --extsize 150 \
  --name sampleA \
  --treatment sampleA.bed

In the attached image, you see the following tracks from top to bottom:

consensus peaks produced by the pipeline (black; there were more samples than the one shown)
bigWig produced by the pipeline (dark blue)
narrow peaks detected by the pipeline (lighter blue)
coverage track generated by me based on a single-end bed file (which itself is based on the bam file by the pipeline) (salmon coloured)
macs peaks generated by me (pink) with --shift -75 --extsize 150
macs peaks generated by me (pink) with --shift -100 --extsize 200
loaded bam file produced by the pipeline

I should say that also this representation is not ideal, as in theory the visualised reads are running into the nucleosome, whereas macs2 uses one end of the read and generates artificial reads around the cut site itself. Therefore reads and peaks still don’t match exactly.

The example still shows the effect of using fragments rather than the cut sites. The fragments (blue tracks) pretty much include/span (very likely) one nuclesome.

I hope all of this makes some sense to you :-). Please share your opinion on this, am I misinterpreting the results? I am curious to see what you think.

I will try a few more examples, and see if I can fix this systematically for the pipeline.

Best & many thanks Katrin

olechnwin commented 1 year ago

@ktrns and @JoseEspinosa,

This may be a silly question but in the region that is shown above, we are assuming the depleted region between the two peaks is where the nucleosome is, right? Can't it be where TF bind instead? How do you tell the difference?

Also, if it is TF binding, having a peak that only capture the open chromatin region on either side of the TF will not include their motifs. When we do motif discovery on the peaks, then the expected TF motifs won't be found?

Also in this suggested macs callpeak parameter, we no longer call broad peaks?

TIA for your insights.

SergioRodLla commented 8 months ago

Hello,

I saw that some papers in which they use macs2 for ATAC-seq they previously filter the fragment sizes to get a set corresponding to nucleosome free regions (NFR) and then they apply macs2 callpeak to them. Do you think this would solve the issue that @ktrns mentions about "visualised reads are running into the nucleosome"? Let me know what you think about this.

Best regards, Sergio

nf-core / atacseq

Peak calling parameters might not be ideal #169