macs3-project / MACS

MACS -- Model-based Analysis of ChIP-Seq
https://macs3-project.github.io/MACS/
BSD 3-Clause "New" or "Revised" License
702 stars 268 forks source link

Q: -log10(qvalue) are 0 in output files #409

Open MalindaH opened 4 years ago

MalindaH commented 4 years ago

Dear Dr. Liu,

I am trying to use MACS2 on human double-strand breaks files.

When I set the cutoff pvalue to 0.05 (in test14), I could get a lot of peaks, but the -log10(qvalue) column in the output files are almost all 0. Here is part of test14_peak.xls:

# This file is   generated by MACS version 2.2.7.1
# Command line: callpeak -t  bowtie-output.bam -c bowtie-outputc.bam -f BAM -g hs -p 0.05 -B -n test14  --outdir test1/test14
# ARGUMENTS LIST:
# name = test14
# format = BAM
# ChIP-seq file = ['bowtie-output.bam']
# control file = ['bowtie-outputc.bam']
# effective genome size = 2.70e+09
# band width = 300
# model fold = [5, 50]
# pvalue cutoff = 5.00e-02
# qvalue will not be calculated and   reported as -1 in the final output.
# The maximum gap between significant   sites is assigned as the read length/tag size.
# The minimum length of peaks is assigned   as the predicted fragment length "d".
# Larger dataset will be scaled towards   smaller dataset.
# Range for calculating regional lambda   is: 1000 bps and 10000 bps
# Broad region calling is off
# Paired-End mode is off
 
# tag size is determined as 38 bps
# total tags in treatment: 5709090
# tags after filtering in treatment:   5050586
# maximum duplicate tags at the same   position in treatment = 1
# Redundant rate in treatment: 0.12
# total tags in control: 1643714
# tags after filtering in control:   1403958
# maximum duplicate tags at the same   position in control = 1
# Redundant rate in control: 0.15
# d = 298
chr | start | end | length | abs_summit | pileup | -log10(pvalue) | fold_enrichment | -log10(qvalue) | name
chr12 | 1.1E+08 | 1.1E+08 | 298 | 1.1E+08 | 1.67 | 1.96519 | 2.30994 | 0 | test14_peak_1139
chr12 | 1.11E+08 | 1.11E+08 | 298 | 1.11E+08 | 1.95 | 1.96519 | 2.55062 | 0 | test14_peak_1140
chr12 | 1.14E+08 | 1.14E+08 | 298 | 1.14E+08 | 1.39 | 1.96519 | 2.06925 | 0 | test14_peak_1141
chr12 | 1.19E+08 | 1.19E+08 | 298 | 1.19E+08 | 1.67 | 1.96519 | 2.30994 | 0 | test14_peak_1142
chr12 | 1.25E+08 | 1.25E+08 | 298 | 1.25E+08 | 1.95 | 1.96519 | 2.55062 | 0 | test14_peak_1143
chr12 | 1.3E+08 | 1.3E+08 | 305 | 1.3E+08 | 1.67 | 1.52089 | 2.10367 | 0 | test14_peak_1144
chr12 | 1.32E+08 | 1.32E+08 | 298 | 1.32E+08 | 1.67 | 1.43779 | 2.05537 | 0 | test14_peak_1145
chr12 | 1.33E+08 | 1.33E+08 | 341 | 1.33E+08 | 1.67 | 1.96519 | 2.30994 | 0 | test14_peak_1146
chr12 | 1.34E+08 | 1.34E+08 | 451 | 1.34E+08 | 48.37 | 40.6812 | 12.3421 | 32.5313 | test14_peak_1147
chr13 | 30027344 | 30027647 | 304 | 30027572 | 1.95 | 1.96519 | 2.55062 | 0 | test14_peak_1148
chr13 | 32683817 | 32684214 | 398 | 32683944 | 1.67 | 1.96519 | 2.30994 | 0 | test14_peak_1149
chr13 | 36372288 | 36372585 | 298 | 36372386 | 1.11 | 1.96519 | 1.82857 | 0 | test14_peak_1150
chr13 | 49105674 | 49105971 | 298 | 49105736 | 1.67 | 1.96519 | 2.30994 | 0 | test14_peak_1151
chr13 | 59508035 | 59508376 | 342 | 59508164 | 1.67 | 1.96519 | 2.30994 | 0 | test14_peak_1152
chr13 | 59936998 | 59937295 | 298 | 59937179 | 1.95 | 1.96519 | 2.55062 | 0 | test14_peak_1153
chr13 | 60828276 | 60828711 | 436 | 60828432 | 2.22 | 3.25781 | 2.79131 | 0 | test14_peak_1154
chr13 | 60839184 | 60839481 | 298 | 60839343 | 1.67 | 1.96519 | 2.30994 | 0 | test14_peak_1155
chr13 | 61727987 | 61728284 | 298 | 61728133 | 1.95 | 1.96519 | 2.55062 | 0 | test14_peak_1156
chr13 | 62251916 | 62252213 | 298 | 62252062 | 1.39 | 1.43779 | 1.84121 | 0 | test14_peak_1157
chr13 | 62868519 | 62868816 | 298 | 62868671 | 1.95 | 1.96519 | 2.55062 | 0 | test14_peak_1158
chr13 | 64825181 | 64825478 | 298 | 64825418 | 1.67 | 1.96519 | 2.30994 | 0 | test14_peak_1159
chr13 | 65697225 | 65697589 | 365 | 65697366 | 2.5 | 3.25781 | 3.03199 | 0 | test14_peak_1160

When I set the cutoff qvalue to 0.05 (in test15), I was only able to get one peak. Here is test15_peak.xls:

# This file is   generated by MACS version 2.2.7.1
# Command line: callpeak -t   bowtie-output.bam -c bowtie-outputc.bam -f BAM -g hs -q 0.05 -B -n test15   --outdir test1/test15
# ARGUMENTS LIST:
# name = test15
# format = BAM
# ChIP-seq file = ['bowtie-output.bam']
# control file = ['bowtie-outputc.bam']
# effective genome size = 2.70e+09
# band width = 300
# model fold = [5, 50]
# qvalue cutoff = 5.00e-02
# The maximum gap between significant   sites is assigned as the read length/tag size.
# The minimum length of peaks is assigned   as the predicted fragment length "d".
# Larger dataset will be scaled towards   smaller dataset.
# Range for calculating regional lambda   is: 1000 bps and 10000 bps
# Broad region calling is off
# Paired-End mode is off
 
# tag size is determined as 38 bps
# total tags in treatment: 5709090
# tags after filtering in treatment:   5050586
# maximum duplicate tags at the same   position in treatment = 1
# Redundant rate in treatment: 0.12
# total tags in control: 1643714
# tags after filtering in control:   1403958
# maximum duplicate tags at the same   position in control = 1
# Redundant rate in control: 0.15
# d = 298
# alternative fragment length(s) may be   83,258,298 bps
chr | start | end | length | abs_summit | pileup | -log10(pvalue) | fold_enrichment | -log10(qvalue) | name
chr12 | 1.34E+08 | 1.34E+08 | 382 | 1.34E+08 | 48.37 | 40.6812 | 12.3421 | 32.5313 | test15_peak_1

It seems that it's not calculating the qvalues correctly because the qvalue cutoff is not really working. Could you take a look and point out what have I done wrong?

Thank you!

System:

MalindaH commented 4 years ago

The sequencing data that i'm using are double-strand break sequences. Does that have an impact in this output?

taoliu commented 4 years ago

@MalindaH q-value is more conservative than p-value. You can see that in your first result, most of the peaks have -log10 p-value of 1.9 or 3 which means they barely passed the cutoff of 0.05. I suspect that those weak peaks are noises in the system. You can ask MACS to generate signal files (-B) in bedGraph format, load them to genome browser such as IGV or UCSC genome browser to see if those peaks look real. Another option is to run macs callpeak --cutoff-analysis then you will see how the total number of peaks change with cutoff value, which will help you decide a good cutoff. You will have an insight that what extent you lower the cutoff, the peak caller starts to capture noises (a sudden increase of peak number).