jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
182 stars 27 forks source link

Genrich with reads from a tiny subset of genome #29

Closed AlexBlais74 closed 4 years ago

AlexBlais74 commented 4 years ago

Hello I am using Genrich to call peaks as part of my ATAC-seq analysis pipeline. To speed up the optimization trials, I have been working with reads coming from only an 8 Mbp region on mouse chr19. However the SAM header in my bam file contains all mouse chromosomes with their entire lengths. I guess this may have thrown Genrich off, because the peaks that were called make little sense. This image shows the bedgraph output from Genrich, and the narrowpeaks called: image

And a zoom in: image

So, something isn't right and I guess it is because Genrich mistakenly thinks these 1.3 million read pairs are evenly distributed over the entire genome.

If my guess is right, then it would be nice if users had the option of manually entering a custom value for genome size, to over-ride the SAM header chromosome length counting. Edit: Please let me know if you can think if a workaround that would work with the current version.

Thanks a lot for your help, and for developing a great program.

Alex

jsh58 commented 4 years ago

Alex,

Thanks for the question, and for including the images.

I guess it is because Genrich mistakenly thinks these 1.3 million read pairs are evenly distributed over the entire genome.

Yes, that's the null model.

Please let me know if you can think if a workaround that would work with the current version.

You can use both -e and -E to eliminate the parts of the genome that aren't part of the assay.

Or, perhaps you could clarify what you mean by "optimization trials." If you wanted to vary just the peak-calling parameters, you could use a combination of -X and -P without subsetting the BAM file.

John Gaspar

AlexBlais74 commented 4 years ago

Thanks for the reply. Silly me, I had the solution right under my nose. I will try that.

By optimization, I meant to explore the effect of changing certain variables (e.g. -d and -D in -j mode) on peak calls and bedgraph outputs. I thought it was superfluous to call peaks on the whole genome and I have limited access to the computer cluster I am using.

Best regards

Alex