nf-core / atacseq

ATAC-seq peak-calling and QC analysis pipeline
https://nf-co.re/atacseq
MIT License
188 stars 117 forks source link

Adding the feature of shifting and splitting the reads #91

Open husaynahmed opened 4 years ago

husaynahmed commented 4 years ago

Hi,

I was wondering if adding the feature of shifting and splitting the reads after the alignment would be helpful.

A recent review on the approaches to analyze ATAC-seq data (Yan et al., 2020) suggests reads should be shifted + 4 bp and − 5 bp for positive and negative strand respectively, to account for the 9-bp duplication created by DNA repair of the nick by Tn5 transposase and achieve base-pair resolution of TF footprint and motif related analyses.

Also, in many studies researchers have looked into the nucleosomal-free regions and nucleosome associated regions separately. For example Yoshida et al., 2019 .

Here's what I am proposing.

Let's have three modes of post-alignment analysis for the ATAC-seq pipeline.

  1. No shifting and splitting the aligned reads
  2. Shifting the reads and splitting them into two - NFR (Nucleosomal Free Regions) and NBR (Nucleosomal Bound regions)
  3. Shifting the reads and splitting them into four - NFR (Nucleosomal Free Regions), Mono-, di- and tri-nucleosomal bound regions.

The first approach (which is the current atacseq pipeline) is useful in almost all cases where identifying the open chromatin regions is the objective. The latter would be helpful if the exact cut sites of transposase is important like motif analyses, footprinting etc., or when the analysis requires looking into the NFR/NBR regions separately.

The downstream analysis and QC could be done depending on which of these options the user chooses. For example, if the user chooses option 2, we could generate mergedLibrary and mergedReplicate BAM, bigWigs etc separately for NFR and NBR and perform peak calling and all the QC for them separately.

I have used deeptools in the past to perform shifting and splitting. Here's an example code for the same.


bam_loc="/path/to/aligned/bamfiles/directory"
out_loc="/path/to/output/directory"

## For NFR (Frag length of 120 bp and less)

alignmentSieve --bam $bam_loc/${SAMPLE}.mLb.clN.sorted.bam \
               --outFile $out_loc/${SAMPLE}.mLb.ss.NFR.bam \
               --ATACshift \
               -p 8 \
               --smartLabels \
               --minFragmentLength 0 \
               --maxFragmentLength 120 \
               --filterMetrics $out_loc/${SAMPLE}.mLb.ss.NFR.log

## For NBR (Frag length of 180 bp and above)

alignmentSieve --bam $bam_loc/${SAMPLE}.mLb.clN.sorted.bam \
               --outFile $out_loc/${SAMPLE}.mLb.ss.NBR.bam \
               --ATACshift \
               -p 8 \
               --smartLabels \
               --minFragmentLength 180 \
               --filterMetrics $out_loc/${SAMPLE}.mLb.ss.NBR.log

It would be great to know comments and suggestions from the nf-core members. I would be happy to discuss. I am a beginner in the nf-core community but would be happy to contribute to this pipeline enhancement.

houghtos commented 4 years ago

Suggest also reading Thomas Caroll ATACseq workflow: https://rockefelleruniversity.github.io/RU_ATAC_Workshop.html Goes into using Subread to partitioning alignment BAM file into nucleosome free region, di-nucleosome, and mono-nucleosome alignment files. Based on the ATACseq Greenleaf paper..

husaynahmed commented 4 years ago

Suggest also reading Thomas Caroll ATACseq workflow: https://rockefelleruniversity.github.io/RU_ATAC_Workshop.html Goes into using Subread to partitioning alignment BAM file into nucleosome free region, di-nucleosome, and mono-nucleosome alignment files. Based on the ATACseq Greenleaf paper..

Thanks, @houghtos. This pipeline also performs partitioning. Perhaps I will give it a try. However, it would be great if we could integrate this feature in the nf-core/atacseq pipeline.

drpatelh commented 4 years ago

Thanks for the detailed info and references @husaynahmed @houghtos. Having thought about this a little I think this is something that would be ideal to implement as a sub-workflow using DSL2. This means that we can just pass a BAM file (filtered in whatever way) to the same sub-workflow to run the downstream processes. I think I will probably start making a serious effort to port this pipeline to DSL2 after the next release so your input and contributions will be more than welcome 😃

nservant commented 3 years ago

Hi @husaynahmed

Thank you for these references. I have an additional question related to reads shifting and peak calling parameters ... which is finally also related to #108.

If I'm correct, when you are using Macs2 --shift 100 --extend 200 or genrich -j -y (see https://informatics.fas.harvard.edu/atac-seq-guidelines.html), you do not need to do reads shifting, as the shift is part of the peak calling analysis ? However, applying the shift with alignmentSieve can indeed be useful if you are using for instance Macs2 in BAMPE mode using the full fragment as a proxy.

Is that correct ? Thanks Nicolas