Request optional fragment length filter

jmw86069 commented 2 years ago

I started using Genrich the past couple months, and wow I'm already loving it! The peak calls are quite good, and the ability to exclude regions, chromosomes upfront turns out to be super convenient. Also nice to use the pileup-like file to make coverage tracks for review. Anyway, thanks so much for putting it out there!

Along those lines, would it be reasonable to filter paired BAM alignments by fragment length? Since you're already using properly paired reads, I think TLEN would suffice. I looked for an option to filter the BAM file, and apologize if I missed something obvious already in place.

Frankly, this feature would sidestep some other processes, making it super convenient to call peaks with different fragment lengths for comparison, and generating pileup-like/coverage files at the same time.

Thanks in advance for your thoughts!

jsh58 commented 2 years ago

Thanks for the kind words and the question. I am unsure of the utility of filtering by fragment length, even in the case of ATAC-seq (see #84 for example), except for the experimental exploration that you are suggesting. If you want to go down that path, here are a couple options:

set a maximum length for alignments to be considered properly paired by the aligner (-X <int> in bowtie2)
filter the BAM yourself using samtools and awk (and yes, TLEN should suffice, though beware that sometimes it is negative); if you're concerned about disk space from generating many BAM files, you can instead pipe them to Genrich via -t -

jmw86069 commented 2 years ago

That's a fair response. ATAC-seq analysis has a lot of differing opinions, methods, pros/cons, all that. There are methods that attempt what you described, subtracting NFR fragment coverage from total fragment coverage for example, resulting in regions with fragment coverage but no Tn5 binding sites. They can work well with right coverage, etc., but for us that isn't the main purpose. When we prep NFR reads, we usually generate fragment coverage, and in our experience the coverage tends to have sharper "edges" than comparable methods. On a TSS-centered coverage heatmap it looks slightly more focused.

The more I think about it, in practice it probably has relatively little effect on peak calling, as compared to taking read ends, expanding to 100-base width in Genrich, etc. Actually quite like the peak calling results fwiw.

NFR is slightly different because at some narrow width, it becomes slightly better to use fragment coverage than read-end coverage, if that makes sense. We "know" for an NFR fragment there was no nucleosome between the two ends when the fragment was created, so it's a wide region inside which we can fairly confidently say is nucleosome-free. Outside the read ends, we can't say that - so at least for these fragments, extending ~50-base outside each read may be slightly detrimental.

But like I said, the number of peaks where this might be helpful? Probably really few - and I'm guessing Genrich already calls peaks using the "ATAC" approach. So I guess it's on me to test and report back how things turned out. :) I'll filter for NFR fragments, run in normal mode, then compare to overall "ATAC" calls using all reads. My prediction is peaks will be nearly identical (except reduced peaks from fewer NFR reads, but from conceptually equivalent places in the genome). I'm curious the effects on coverage, mostly in coverage heatmaps.

jsh58 / Genrich

Request optional fragment length filter #93