jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
182 stars 27 forks source link

Peak detection when using multiple input BAM files #14

Closed ppnin closed 5 years ago

ppnin commented 5 years ago

Hi,

Thank you very much for developing Genrich, I use it with ATAC-seq data and it's been very helpful!

I have a question regarding peak calling with multiple input BAM files (biological replicates), when Genrich calculates combined p-values. Assume that we have 2 replicates as a treatment input (without control) and Genrich detects a peak on similar intervals in both BAM files. If the combined p-value indicates statistical significance - which peak interval is reported in the resulting narrowPeak file?

For example, this is the narrowPeak file that I get when I use Genrich only with the first BAM,

chr1    564427  564463  peak_0  1000    .   55.448097   7.310349    3.789625    35
chr1    713893  714248  peak_1  968 .   343.498718  6.771811    3.445047    183
chr1    935398  935612  peak_2  379 .   81.118309   5.002090    2.242451    131
chr1    1051502 1051812 peak_3  377 .   116.836372  5.263575    2.423428    253
chr1    1120695 1120890 peak_4  860 .   167.778976  5.502086    2.587958    109
...

and this is the output when only the second BAM is used:

chr1    237697  237827  peak_0  496 .   64.472443   4.489713    1.942851    64
chr1    713928  714367  peak_1  1000    .   581.454346  7.463357    4.000597    221
chr1    840132  840232  peak_2  555 .   55.489933   4.794406    2.159927    52
chr1    935325  935687  peak_3  777 .   281.245453  5.751664    2.839107    266
chr1    935881  935981  peak_4  408 .   40.818642   4.489713    1.942851    42
...

When I use them together as a treatment input the result is

chr1    20094   20191   peak_0  558 .   54.145229   3.849536    1.887959    53
chr1    237688  237829  peak_1  1000    .   256.377411  5.951506    3.625094    72
chr1    521521  521608  peak_2  308 .   26.816595   3.482872    1.609267    43
chr1    564418  564463  peak_3  1000    .   177.645432  11.178493   8.029524    44
chr1    713872  714360  peak_4  1000    .   2027.053711 12.480982   9.106486    222
...

Looking at the peak coordinates, I see that peak_4 in the third narrowPeak should correspond to the peak_1 in the first and second narrowPeak file, while peak_0 is detected on coordinates where we don't have a peak when using separate BAM files.

In my understanding (section “Peak-calling method” in the documentation) Genrich first calculates AUC for statistically significant regions, and then defines peaks as regions whose AUC is above a threshold. How is AUC calculated when using combined p-values?

Thanks!

Pavle

jsh58 commented 5 years ago

Pavle,

Thanks for the question.

Assume that we have 2 replicates as a treatment input (without control) and Genrich detects a peak on similar intervals in both BAM files.

Let me stop you there. With multiple input files, Genrich does not call peaks from the files separately. Instead, it computes p-values for each and then combines them, as described here.

A log file (-f) should clarify why peak_0 and the other peaks were called when analyzing the replicates together.

John Gaspar

ppnin commented 5 years ago

John,

Yes, thanks for pointing that out - checking the documentation again has cleared up my confusion.

Pavle