jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
182 stars 27 forks source link

Expectations when excluding blacklist regions (-E) #23

Closed pgugger closed 4 years ago

pgugger commented 4 years ago

Hi, I am running Genrich on a test sample and get no peaks when excluding the ENCODE mouse blacklist regions using the -E mm10-blacklist.v2.bed.gz argument. However, when I do not use this argument, I get many peaks, only a small subset of which overlap with the blacklist regions. Please see below for details. Is this expected behavior, and if so can you explain why? Thank you for your help!

Run with -E mm10-blacklist.v2.bed.gz to exclude blacklist regions results in no peaks:

$ Genrich -t sample.bam -j -e chrM -E mm10-blacklist.v2.bed.gz -m 30 -q 0.05 -f sample.pq.bedGraph -k sample.pileup.bedGraph -o sample.narrowPeak -v
Processing experimental file #0: sample.bam
  BAM records analyzed:   16256998
    Paired alignments:    16256998
    Unpaired alignments:         0
  Fragments analyzed:      8128499
    Full fragments:        8128499
    ATAC-seq cut sites:   16256998
      (expanded to length 100bp)
- control file #0 not provided -
  Background pileup value: 0.585848
Peak-calling parameters:
  Genome length: 2486544170bp
  Significance threshold: -log(q) > 1.301
  Min. AUC: 20.000
  Max. gap between sites: 100bp
Peaks identified: 0 (0bp)

Run without -E mm10-blacklist.v2.bed.gz, thus leaving in blacklist regions, results in 6056 peaks:

$ Genrich -t sample.bam -j -e chrM -m 30 -q 0.05 -f sample.run2.pq.bedGraph -k sample.run2.pileup.bedGraph -o sample.run2.narrowPeak -v
Processing experimental file #0: sample.bam
  BAM records analyzed:   16256998
    Paired alignments:    16256998
    Unpaired alignments:         0
  Fragments analyzed:      8128499
    Full fragments:        8128499
    ATAC-seq cut sites:   16256998
      (expanded to length 100bp)
- control file #0 not provided -
  Background pileup value: 0.544793
Peak-calling parameters:
  Genome length: 2725521370bp
  Significance threshold: -log(q) > 1.301
  Min. AUC: 20.000
  Max. gap between sites: 100bp
Peaks identified: 6056 (1521480bp)

Number of tracks after subtracting blacklist regions from Genrich run without -E argument:

$ bedtools subtract -a sample.run2.narrowPeak -b mm10-blacklist.v2.bed.gz | wc -l
5842

Number of base pairs in tracks after subtracting blacklist regions from Genrich run without -E argument:

$ bedtools subtract -a sample.run2.narrowPeak -b /mm10-blacklist.v2.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
1414055

As you can see, most peaks do not overlap with blacklisted regions.

jsh58 commented 4 years ago

Thanks for the detailed question.

Do those peaks survive if you add -A to the bedtools subtract command? If not, that is because (according to the README):

All alignments, or portions of alignments, that lie within the given genomic regions are excluded from peak-calling, and no peak may extend into or around an excluded region.

If the peaks do survive, please pick a peak and post the portions of the two -f log files corresponding to the vicinity of that peak.

pgugger commented 4 years ago

Thanks for you response. Yes, they mostly do survive:

$ bedtools subtract -A -a sample.run2.narrowPeak -b mm10-blacklist.v2.bed.gz | wc -l
5831

$ bedtools subtract -A -a sample.run2.narrowPeak -b mm10-blacklist.v2.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
1413020

Here is a portion of the f log from sample.pq.bedGraph (run with -E mm10-blacklist.v2.bed.gz):

chr10   3697450 3697458 1.000000        0.585848        0.825216        0.000000
chr10   3697458 3697461 2.000000        0.585848        1.418231        0.000000
chr10   3697461 3697471 3.000000        0.585848        1.858640        0.000000
chr10   3697471 3697472 5.000000        0.585848        2.515833        0.061369
chr10   3697472 3697474 6.000000        0.585848        2.778574        0.214014
chr10   3697474 3697478 7.000000        0.585848        3.012423        0.354862
chr10   3697478 3697482 9.000000        0.585848        3.416824        0.598978
chr10   3697482 3697488 8.000000        0.585848        3.223709        0.482652
chr10   3697488 3697498 9.000000        0.585848        3.416824        0.598978
chr10   3697498 3697516 10.000000       0.585848        3.594950        0.705523
chr10   3697516 3697521 11.000000       0.585848        3.760472        0.804364
chr10   3697521 3697549 12.000000       0.585848        3.915230        0.877875
chr10   3697549 3697551 13.000000       0.585848        4.060675        0.877875
chr10   3697551 3697558 14.000000       0.585848        4.197975        0.877875
chr10   3697558 3697559 13.000000       0.585848        4.060675        0.877875
chr10   3697559 3697561 16.000000       0.585848        4.451785        0.877875
chr10   3697561 3697562 17.000000       0.585848        4.569746        0.877875
chr10   3697562 3697565 18.000000       0.585848        4.682526        0.877875
chr10   3697565 3697569 19.000000       0.585848        4.790605        0.877875
chr10   3697569 3697571 20.000000       0.585848        4.894397        0.877875
chr10   3697571 3697572 18.000000       0.585848        4.682526        0.877875
chr10   3697572 3697574 17.000000       0.585848        4.569746        0.877875
chr10   3697574 3697575 16.000000       0.585848        4.451785        0.877875
chr10   3697575 3697576 18.000000       0.585848        4.682526        0.877875
chr10   3697576 3697578 19.000000       0.585848        4.790605        0.877875
chr10   3697578 3697585 18.000000       0.585848        4.682526        0.877875
chr10   3697585 3697586 19.000000       0.585848        4.790605        0.877875
chr10   3697586 3697588 20.000000       0.585848        4.894397        0.877875
chr10   3697588 3697592 19.000000       0.585848        4.790605        0.877875
chr10   3697592 3697599 20.000000       0.585848        4.894397        0.877875
chr10   3697599 3697600 22.000000       0.585848        5.090512        0.877875
chr10   3697600 3697612 23.000000       0.585848        5.183427        0.877875
chr10   3697612 3697616 24.000000       0.585848        5.273252        0.877875
chr10   3697616 3697618 23.000000       0.585848        5.183427        0.877875
chr10   3697618 3697621 24.000000       0.585848        5.273252        0.877875
chr10   3697621 3697622 23.000000       0.585848        5.183427        0.877875
chr10   3697622 3697636 25.000000       0.585848        5.360206        0.877875
chr10   3697636 3697644 26.000000       0.585848        5.444482        0.877875
chr10   3697644 3697649 27.000000       0.585848        5.526257        0.877875
chr10   3697649 3697651 26.000000       0.585848        5.444482        0.877875
chr10   3697651 3697659 25.000000       0.585848        5.360206        0.877875
chr10   3697659 3697661 23.000000       0.585848        5.183427        0.877875
chr10   3697661 3697662 22.000000       0.585848        5.090512        0.877875
chr10   3697662 3697664 21.000000       0.585848        4.994261        0.877875
chr10   3697664 3697665 22.000000       0.585848        5.090512        0.877875
chr10   3697665 3697666 21.000000       0.585848        4.994261        0.877875
chr10   3697666 3697668 22.000000       0.585848        5.090512        0.877875
chr10   3697668 3697669 23.000000       0.585848        5.183427        0.877875
chr10   3697669 3697679 22.000000       0.585848        5.090512        0.877875
chr10   3697679 3697680 23.000000       0.585848        5.183427        0.877875
chr10   3697680 3697681 27.000000       0.585848        5.526257        0.877875
chr10   3697681 3697683 28.000000       0.585848        5.605688        0.877875
chr10   3697683 3697684 29.000000       0.585848        5.682919        0.877875
chr10   3697684 3697685 30.000000       0.585848        5.758079        0.877875
chr10   3697685 3697686 29.000000       0.585848        5.682919        0.877875
chr10   3697686 3697690 28.000000       0.585848        5.605688        0.877875
chr10   3697690 3697691 29.000000       0.585848        5.682919        0.877875
chr10   3697691 3697693 30.000000       0.585848        5.758079        0.877875
chr10   3697693 3697694 29.000000       0.585848        5.682919        0.877875
chr10   3697694 3697696 30.000000       0.585848        5.758079        0.877875
chr10   3697696 3697698 29.000000       0.585848        5.682919        0.877875
chr10   3697698 3697699 28.000000       0.585848        5.605688        0.877875
chr10   3697699 3697720 26.000000       0.585848        5.444482        0.877875
chr10   3697720 3697739 27.000000       0.585848        5.526257        0.877875
chr10   3697739 3697749 28.000000       0.585848        5.605688        0.877875
chr10   3697749 3697750 29.000000       0.585848        5.682919        0.877875
chr10   3697750 3697756 28.000000       0.585848        5.605688        0.877875
chr10   3697756 3697758 29.000000       0.585848        5.682919        0.877875
chr10   3697758 3697759 30.000000       0.585848        5.758079        0.877875
chr10   3697759 3697761 31.000000       0.585848        5.831288        0.877875
chr10   3697761 3697763 32.000000       0.585848        5.902652        0.877875
chr10   3697763 3697765 33.000000       0.585848        5.972272        0.877875
chr10   3697765 3697766 32.000000       0.585848        5.902652        0.877875
chr10   3697766 3697768 30.000000       0.585848        5.758079        0.877875
chr10   3697768 3697770 26.000000       0.585848        5.444482        0.877875
chr10   3697770 3697775 25.000000       0.585848        5.360206        0.877875
chr10   3697775 3697776 24.000000       0.585848        5.273252        0.877875
chr10   3697776 3697780 23.000000       0.585848        5.183427        0.877875
chr10   3697780 3697783 18.000000       0.585848        4.682526        0.877875
chr10   3697783 3697784 17.000000       0.585848        4.569746        0.877875
chr10   3697784 3697790 16.000000       0.585848        4.451785        0.877875
chr10   3697790 3697791 15.000000       0.585848        4.328082        0.877875
chr10   3697791 3697794 14.000000       0.585848        4.197975        0.877875
chr10   3697794 3697798 13.000000       0.585848        4.060675        0.877875
chr10   3697798 3697799 14.000000       0.585848        4.197975        0.877875
chr10   3697799 3697806 13.000000       0.585848        4.060675        0.877875
chr10   3697806 3697814 14.000000       0.585848        4.197975        0.877875
chr10   3697814 3697839 12.000000       0.585848        3.915230        0.877875
chr10   3697839 3697849 11.000000       0.585848        3.760472        0.804364
chr10   3697849 3697855 10.000000       0.585848        3.594950        0.705523
chr10   3697855 3697856 11.000000       0.585848        3.760472        0.804364
chr10   3697856 3697858 9.000000        0.585848        3.416824        0.598978
chr10   3697858 3697861 7.000000        0.585848        3.012423        0.354862
chr10   3697861 3697862 6.000000        0.585848        2.778574        0.214014
chr10   3697862 3697863 7.000000        0.585848        3.012423        0.354862
chr10   3697863 3697877 5.000000        0.585848        2.515833        0.061369
chr10   3697877 3697891 6.000000        0.585848        2.778574        0.214014
chr10   3697891 3697898 5.000000        0.585848        2.515833        0.061369
chr10   3697898 3697906 4.000000        0.585848        2.214529        0.000000
chr10   3697906 3697955 3.000000        0.585848        1.858640        0.000000
chr10   3697955 3697962 2.000000        0.585848        1.418231        0.000000
chr10   3697962 3697977 1.000000        0.585848        0.825216        0.000000
chr10   3697977 3698117 0.000000        0.585848        0.000000        0.000000

Here is a portion of the -f log from sample.run2.pq.bedGraph (run without -E mm10-blacklist.v2.bed.gz):

chr10   3697450 3697458 1.000000        0.544793        0.878237        0.000000
chr10   3697458 3697461 2.000000        0.544793        1.491956        0.000000
chr10   3697461 3697471 3.000000        0.544793        1.945085        0.000000
chr10   3697471 3697472 5.000000        0.544793        2.618742        0.139011
chr10   3697472 3697474 6.000000        0.544793        2.887451        0.298725
chr10   3697474 3697478 7.000000        0.544793        3.126377        0.445712
chr10   3697478 3697482 9.000000        0.544793        3.539111        0.700682
chr10   3697482 3697488 8.000000        0.544793        3.342083        0.579165
chr10   3697488 3697498 9.000000        0.544793        3.539111        0.700682
chr10   3697498 3697516 10.000000       0.544793        3.720748        0.812199
chr10   3697516 3697521 11.000000       0.544793        3.889454        0.915637
chr10   3697521 3697549 12.000000       0.544793        4.047126        1.011692
chr10   3697549 3697551 13.000000       0.544793        4.195258        1.100782
chr10   3697551 3697558 14.000000       0.544793        4.335048        1.183910
chr10   3697558 3697559 13.000000       0.544793        4.195258        1.100782
chr10   3697559 3697561 16.000000       0.544793        4.593356        1.335905        *
chr10   3697561 3697562 17.000000       0.544793        4.713363        1.405449        *
chr10   3697562 3697565 18.000000       0.544793        4.828074        1.471178        *
chr10   3697565 3697569 19.000000       0.544793        4.937982        1.532959        *
chr10   3697569 3697571 20.000000       0.544793        5.043510        1.591760        *
chr10   3697571 3697572 18.000000       0.544793        4.828074        1.471178        *
chr10   3697572 3697574 17.000000       0.544793        4.713363        1.405449        *
chr10   3697574 3697575 16.000000       0.544793        4.593356        1.335905        *
chr10   3697575 3697576 18.000000       0.544793        4.828074        1.471178        *
chr10   3697576 3697578 19.000000       0.544793        4.937982        1.532959        *
chr10   3697578 3697585 18.000000       0.544793        4.828074        1.471178        *
chr10   3697585 3697586 19.000000       0.544793        4.937982        1.532959        *
chr10   3697586 3697588 20.000000       0.544793        5.043510        1.591760        *
chr10   3697588 3697592 19.000000       0.544793        4.937982        1.532959        *
chr10   3697592 3697599 20.000000       0.544793        5.043510        1.591760        *
chr10   3697599 3697600 22.000000       0.544793        5.242857        1.702696        *
chr10   3697600 3697612 23.000000       0.544793        5.337281        1.754841        *
chr10   3697612 3697616 24.000000       0.544793        5.428551        1.805139        *
chr10   3697616 3697618 23.000000       0.544793        5.337281        1.754841        *
chr10   3697618 3697621 24.000000       0.544793        5.428551        1.805139        *
chr10   3697621 3697622 23.000000       0.544793        5.337281        1.754841        *
chr10   3697622 3697636 25.000000       0.544793        5.516891        1.852799        *
chr10   3697636 3697644 26.000000       0.544793        5.602501        1.898507        *
chr10   3697644 3697649 27.000000       0.544793        5.685560        1.942111        *
chr10   3697649 3697651 26.000000       0.544793        5.602501        1.898507        *
chr10   3697651 3697659 25.000000       0.544793        5.516891        1.852799        *
chr10   3697659 3697661 23.000000       0.544793        5.337281        1.754841        *
chr10   3697661 3697662 22.000000       0.544793        5.242857        1.702696        *
chr10   3697662 3697664 21.000000       0.544793        5.145028        1.648472        *
chr10   3697664 3697665 22.000000       0.544793        5.242857        1.702696        *
chr10   3697665 3697666 21.000000       0.544793        5.145028        1.648472        *
chr10   3697666 3697668 22.000000       0.544793        5.242857        1.702696        *
chr10   3697668 3697669 23.000000       0.544793        5.337281        1.754841        *
chr10   3697669 3697679 22.000000       0.544793        5.242857        1.702696        *
chr10   3697679 3697680 23.000000       0.544793        5.337281        1.754841        *
chr10   3697680 3697681 27.000000       0.544793        5.685560        1.942111        *
chr10   3697681 3697683 28.000000       0.544793        5.766229        1.984694        *
chr10   3697683 3697684 29.000000       0.544793        5.844656        2.026038        *
chr10   3697684 3697685 30.000000       0.544793        5.920971        2.065376        *
chr10   3697685 3697686 29.000000       0.544793        5.844656        2.026038        *
chr10   3697686 3697690 28.000000       0.544793        5.766229        1.984694        *
chr10   3697690 3697691 29.000000       0.544793        5.844656        2.026038        *
chr10   3697691 3697693 30.000000       0.544793        5.920971        2.065376        *
chr10   3697693 3697694 29.000000       0.544793        5.844656        2.026038        *
chr10   3697694 3697696 30.000000       0.544793        5.920971        2.065376        *
chr10   3697696 3697698 29.000000       0.544793        5.844656        2.026038        *
chr10   3697698 3697699 28.000000       0.544793        5.766229        1.984694        *
chr10   3697699 3697720 26.000000       0.544793        5.602501        1.898507        *
chr10   3697720 3697739 27.000000       0.544793        5.685560        1.942111        *
chr10   3697739 3697749 28.000000       0.544793        5.766229        1.984694        *
chr10   3697749 3697750 29.000000       0.544793        5.844656        2.026038        *
chr10   3697750 3697756 28.000000       0.544793        5.766229        1.984694        *
chr10   3697756 3697758 29.000000       0.544793        5.844656        2.026038        *
chr10   3697758 3697759 30.000000       0.544793        5.920971        2.065376        *
chr10   3697759 3697761 31.000000       0.544793        5.995297        2.102487        *
chr10   3697761 3697763 32.000000       0.544793        6.067745        2.139020        *
chr10   3697763 3697765 33.000000       0.544793        6.138415        2.174778        *
chr10   3697765 3697766 32.000000       0.544793        6.067745        2.139020        *
chr10   3697766 3697768 30.000000       0.544793        5.920971        2.065376        *
chr10   3697768 3697770 26.000000       0.544793        5.602501        1.898507        *
chr10   3697770 3697775 25.000000       0.544793        5.516891        1.852799        *
chr10   3697775 3697776 24.000000       0.544793        5.428551        1.805139        *
chr10   3697776 3697780 23.000000       0.544793        5.337281        1.754841        *
chr10   3697780 3697783 18.000000       0.544793        4.828074        1.471178        *
chr10   3697783 3697784 17.000000       0.544793        4.713363        1.405449        *
chr10   3697784 3697790 16.000000       0.544793        4.593356        1.335905        *
chr10   3697790 3697791 15.000000       0.544793        4.467477        1.262177
chr10   3697791 3697794 14.000000       0.544793        4.335048        1.183910
chr10   3697794 3697798 13.000000       0.544793        4.195258        1.100782
chr10   3697798 3697799 14.000000       0.544793        4.335048        1.183910
chr10   3697799 3697806 13.000000       0.544793        4.195258        1.100782
chr10   3697806 3697814 14.000000       0.544793        4.335048        1.183910
chr10   3697814 3697839 12.000000       0.544793        4.047126        1.011692
chr10   3697839 3697849 11.000000       0.544793        3.889454        0.915637
chr10   3697849 3697855 10.000000       0.544793        3.720748        0.812199
chr10   3697855 3697856 11.000000       0.544793        3.889454        0.915637
chr10   3697856 3697858 9.000000        0.544793        3.539111        0.700682
chr10   3697858 3697861 7.000000        0.544793        3.126377        0.445712
chr10   3697861 3697862 6.000000        0.544793        2.887451        0.298725
chr10   3697862 3697863 7.000000        0.544793        3.126377        0.445712
chr10   3697863 3697877 5.000000        0.544793        2.618742        0.139011
chr10   3697877 3697891 6.000000        0.544793        2.887451        0.298725
chr10   3697891 3697898 5.000000        0.544793        2.618742        0.139011
chr10   3697898 3697906 4.000000        0.544793        2.310195        0.000000
chr10   3697906 3697955 3.000000        0.544793        1.945085        0.000000
chr10   3697955 3697962 2.000000        0.544793        1.491956        0.000000
chr10   3697962 3697977 1.000000        0.544793        0.878237        0.000000
chr10   3697977 3698117 0.000000        0.544793        0.000000        0.000000
jsh58 commented 4 years ago

Excellent, thanks! That's very helpful.

It appears that the pileup and p-values are similar in the two cases, but that the q-values are much lower when you include the blacklist. I would recommend that you use a p-value threshold for peak-calling, e.g. -p0.01 -a200. In fact, this is what I am changing the default parameters of Genrich to, after some discussion with my colleagues (@tsackton, @crazyhottommy).

Still, it would be good to know if the p->q calculation of Genrich is flawed. Could you please post the result of the following for your first analysis (with -E):

awk '{if ($7 > 0.877875) sum += $3 - $2; if ($7 == 0.877875 && $6 > max) max = $6} END {print sum, max}' sample.pq.bedGraph
pgugger commented 4 years ago

Ok, thanks! I will re-run with those parameter values. Here is the result you asked for:

$ awk '{if ($7 > 0.877875) sum += $3 - $2; if ($7 == 0.877875 && $6 > max) max = $6} END {print sum, max}' sample.pq.bedGraph
238977200 10.273472
jsh58 commented 4 years ago

Hmm, that doesn't look right. Are you sure you ran this on the bedgraph from your first analysis with -E?

pgugger commented 4 years ago

Yes, I just double-checked, rerunning everything to be sure. Here is the same result for the run without -E for comparison:

$ awk '{if ($7 > 0.877875) sum += $3 - $2; if ($7 == 0.877875 && $6 > max) max = $6} END {print sum, max}' sample.run2.pq.bedGraph
3364341 
jsh58 commented 4 years ago

Thanks, I do appreciate that. I think (and I hope) that I made a mistake with the awk command. If you don't mind, can you please run this:

$ awk '{if ($7 != "NA" && $7 > 0.877875) sum += $3 - $2; if ($7 == 0.877875 && $6 > max) max = $6} END {print sum, max}' sample.pq.bedGraph
pgugger commented 4 years ago
$ awk '{if ($7 != "NA" && $7 > 0.877875) sum += $3 - $2; if ($7 == 0.877875 && $6 > max) max = $6} END {print sum, max}' sample.pq.bedGraph
 10.273472
jsh58 commented 4 years ago

Great, thank you very much! The program is working correctly. I still recommend -p0.01 -a200, and those are the new defaults in the program.

Also, on a related note: your reference genome, mm10, contains long stretches of Ns, and Genrich thinks that they are part of the assay. To exclude them, you need to give Genrich a BED file of N homopolymers, which can be easily generated by findNs.py. Multiple files, such as the blacklist and the N homopolymer file, can be given to Genrich by -E, as described here. A full analysis example can be found here.

pgugger commented 4 years ago

Thanks! I actually hadn't noticed findNs.py, but will now add that to my analysis.