jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
182 stars 27 forks source link

Is this miss expected? #62

Closed dariober closed 3 years ago

dariober commented 3 years ago

Hi- I'm puzzled why an apparently convincing peak has been missed by Genrich v0.6. Here's a view of the peak (top: ChIP, bottom: input control):

PbANKA_12_v3_621857_766241

The peak summit has about 40x read coverage.

A subset of the input files including only two chromosomes is here https://www.dropbox.com/s/i5skmxlr0zrk2ks/files.zip?dl=0 and the command I used is:

Genrich -v -y -t chip.bam -c input.bam -o tmp.narrowPeak

Processing experimental file #0: chip.bam
  BAM records analyzed:      247872
    Paired alignments:            0
    Unpaired alignments:     247872
  Fragments analyzed:        247872
    Full fragments:               0
    Half fragments:          247872
      (from unpaired alns)
Processing control file #0: input.bam
  BAM records analyzed:      764018
    Paired alignments:            0
    Unpaired alignments:     764018
  Fragments analyzed:        764018
    Full fragments:               0
    Half fragments:          764018
      (from unpaired alns)
  Background pileup value: 0.998186
  Scaling factor for control pileup: 0.324661
Peak-calling parameters:
  Genome length: 18607862bp
  Significance threshold: -log(p) > 2.000
  Min. AUC: 200.000
  Max. gap between sites: 100bp
Peaks identified: 47 (115615bp)

Is this miss expected? Thanks!

jsh58 commented 3 years ago

Thanks for the question. Keep in mind that, with single-end sequencing and -y, the fragments are interpreted by Genrich as spanning only the length of the reads (see this section of the README).

Could you please post the section from the bedgraph-ish output file (-f <file>) corresponding to the putative peak when you run the above command?

dariober commented 3 years ago

Thanks for getting back to me - I attach here the bedgraph in the region PbANKA_12_v3:600000-800000. The peak is around 614000-700000. I attach the also narrowPeak file.

pval.target.bdg.gz tmp.narrowPeak.gz

jsh58 commented 3 years ago

Thanks for the files. The control pileup is pretty high in this region, in the 3-6 range (the overall background is 0.998, per the log file). By my calculation, the region 692553-692731 has an AUC of ~120, less than the 200 minimum required for a peak to be called. The following should result in a peak being called:

But the best solution would be to extend the reads to the average DNA fragment length (via -w <int>). Whoever prepared the DNA should know this value. This will boost the treatment pileup values and may result in a peak being called in this region, without altering -a or -p.