jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
182 stars 27 forks source link

Single-end ChIP-seq: difference between running replicates together or separately #51

Closed Simso86 closed 4 years ago

Simso86 commented 4 years ago

I was running analysis on single-end ChIP-seq data and called peaks using Genrich (on BAM files aligned with Bowtie2). First I included my two replicates and obtained ~ 46.000 peaks. I have one control sample (IgG) which apparently I needed to state twice in the command (once for each analysed replicate).

While I usually use an AUC cut-off of 200.0 when I run Genrich for ATAC-seq data, I here instead use an AUC cut-off of 20.0, because of the expected narrower peaks of ChIP-seq. However, I'm still not sure what should be considered as a "good" AUC cut-off in either case. This might be the topic of another issue post.

Below is my code used:

Genrich -t Rep1_sorted.bam, Rep2_sorted.bam -c Control_sorted.bam,Control_sorted.bam -o PeaksCalled.narrowPeak -f PeaksCalled_pq.bdg -k PeaksCalled_pileup.bdg -b PeaksCalled.bed -r -y -q 0.05 -a 20.0 -v -e chrM -E /GenomeReferences/mm10-blacklist.v2.bed.gz 2>&1 | tee PeakCalling_Info.txt

I am piping the output to a file "PeakCalling_Info.txt" which afterwards contains the following:

Processing experimental file 0: Rep1_sorted.bam BAM records analyzed: 24222187 Unmapped: 2013547 To skipped refs: 43816 (chrM) Paired alignments: 0 Unpaired alignments: 22164824 PCR duplicates -- Paired aln sets: 0 duplicates: 0 (0.0%) Discordant aln sets: 0 duplicates: 0 (0.0%) Singleton aln sets: 22208640 duplicates: 12606377 (56.8%) Fragments analyzed: 9587789 Full fragments: 0 Half fragments: 9587789 (from unpaired alns) Processing control file 0: Control_sorted.bam BAM records analyzed: 21193969 Unmapped: 1647613 To skipped refs: 104501 (chrM) Paired alignments: 0 Unpaired alignments: 19441855 PCR duplicates -- Paired aln sets: 0 duplicates: 0 (0.0%) Discordant aln sets: 0 duplicates: 0 (0.0%) Singleton aln sets: 19546356 duplicates: 8141660 (41.7%) Fragments analyzed: 11378094 Full fragments: 0 Half fragments: 11378094 (from unpaired alns) Background pileup value: 0.180496 Scaling factor for control pileup: 0.844719 Processing experimental file 1: Rep2_sorted.bam BAM records analyzed: 24620207 Unmapped: 1283799 To skipped refs: 71918 (chrM) Paired alignments: 0 Unpaired alignments: 23264490 PCR duplicates -- Paired aln sets: 0 duplicates: 0 (0.0%) Discordant aln sets: 0 duplicates: 0 (0.0%) Singleton aln sets: 23336408 duplicates: 10862648 (46.5%) Fragments analyzed: 12452279 Full fragments: 0 Half fragments: 12452279 (from unpaired alns) Processing control file 1: Control_sorted.bam BAM records analyzed: 21193969 Unmapped: 1647613 To skipped refs: 104501 (chrM) Paired alignments: 0 Unpaired alignments: 19441855 PCR duplicates -- Paired aln sets: 0 duplicates: 0 (0.0%) Discordant aln sets: 0 duplicates: 0 (0.0%) Singleton aln sets: 19546356 duplicates: 8141660 (41.7%) Fragments analyzed: 11378094 Full fragments: 0 Half fragments: 11378094 (from unpaired alns) Background pileup value: 0.236650 Scaling factor for control pileup: 1.107519 Peak-calling parameters: Genome length: 2486544170bp Significance threshold: -log(q) > 1.301 Min. AUC: 20.000 Max. gap between sites: 100bp Peaks identified: 46418 (10476962bp)

As far as I have understood Genrich call peaks for the replicates separately and then combine their calculated p-values. But when I call peaks for each replicate myself (which I need in order to compare the replicates in certain ways) all q-values become 1 and no peaks are called for any of the replicates.

I used the following two lines of code:

Genrich -t Rep1_sorted.bam -c Control_sorted.bam -o Rep1.narrowPeak -f Rep1_pq.bdg -k Rep1_pileup.bdg -b Rep1.bed -r -y -q 0.05 -a 20.0 -v -e chrM -E /GenomeReferences/mm10-blacklist.v2.bed.gz 2>&1 | tee /Individual_samples/Rep1_PeakCalling_Info.txt

Genrich -t Rep2_sorted.bam -c Control_sorted.bam -o Rep2.narrowPeak -f Rep2_pq.bdg -k Rep2_pileup.bdg -b Rep2.bed -r -y -q 0.05 -a 20.0 -v -e chrM -E /GenomeReferences/mm10-blacklist.v2.bed.gz 2>&1 | tee /Individual_samples/Rep2_PeakCalling_Info.txt

I piped the output from each of these runs as before. Output for Rep1 is:

Processing experimental file 0: Rep1_sorted.bam BAM records analyzed: 24222187 Unmapped: 2013547 To skipped refs: 43816 (chrM) Paired alignments: 0 Unpaired alignments: 22164824 PCR duplicates -- Paired aln sets: 0 duplicates: 0 (0.0%) Discordant aln sets: 0 duplicates: 0 (0.0%) Singleton aln sets: 22208640 duplicates: 12606377 (56.8%) Fragments analyzed: 9587789 Full fragments: 0 Half fragments: 9587789 (from unpaired alns) Processing control file 0: Control_sorted.bam BAM records analyzed: 21193969 Unmapped: 1647613 To skipped refs: 104501 (chrM) Paired alignments: 0 Unpaired alignments: 19441855 PCR duplicates -- Paired aln sets: 0 duplicates: 0 (0.0%) Discordant aln sets: 0 duplicates: 0 (0.0%) Singleton aln sets: 19546356 duplicates: 8141660 (41.7%) Fragments analyzed: 11378094 Full fragments: 0 Half fragments: 11378094 (from unpaired alns) Background pileup value: 0.180496 Scaling factor for control pileup: 0.844719 Peak-calling parameters: Genome length: 2486544170bp Significance threshold: -log(q) > 1.301 Min. AUC: 20.000 Max. gap between sites: 100bp Warning! All q-values are 1 Peaks identified: 0 (0bp)

And for Rep2:

Processing experimental file 0: Rep2_sorted.bam BAM records analyzed: 24620207 Unmapped: 1283799 To skipped refs: 71918 (chrM) Paired alignments: 0 Unpaired alignments: 23264490 PCR duplicates -- Paired aln sets: 0 duplicates: 0 (0.0%) Discordant aln sets: 0 duplicates: 0 (0.0%) Singleton aln sets: 23336408 duplicates: 10862648 (46.5%) Fragments analyzed: 12452279 Full fragments: 0 Half fragments: 12452279 (from unpaired alns) Processing control file 0: Control_sorted.bam BAM records analyzed: 21193969 Unmapped: 1647613 To skipped refs: 104501 (chrM) Paired alignments: 0 Unpaired alignments: 19441855 PCR duplicates -- Paired aln sets: 0 duplicates: 0 (0.0%) Discordant aln sets: 0 duplicates: 0 (0.0%) Singleton aln sets: 19546356 duplicates: 8141660 (41.7%) Fragments analyzed: 11378094 Full fragments: 0 Half fragments: 11378094 (from unpaired alns) Background pileup value: 0.236650 Scaling factor for control pileup: 1.107519 Peak-calling parameters: Genome length: 2486544170bp Significance threshold: -log(q) > 1.301 Min. AUC: 20.000 Max. gap between sites: 100bp Warning! All q-values are 1 Peaks identified: 0 (0bp)

Looking at the output information about number of fragments analysed, background pileup value etc. there seem to be no differences between calling peaks on replicates together or separately. I used the exact same command line with the only difference being the number of replicates included. I have read the README but did not find anything that could explain my problem.

I have also tried to look at earlier posted issues, the one closest resembling my problem being #41. However, the answer there does not help me. I am using -y parameter and have given -q 0.05

So, what could be the explanation to why I get a lot of peaks when using the replicates together but no peaks when using them separately, even though I use the same settings?

jsh58 commented 4 years ago

I believe this is addressed in the README and #14.

Simso86 commented 4 years ago

John,

Thank you for your answer. The reason I explained my issue here was because I could not find an answer in the README. Neither does #14 fully address my problem.

The main question is: How is it even possible that running Genrich peakcalling on the replicates separately gives 0 peaks, while running the replicates together gives 46.000 peaks.

As you have pointed out in #14 and according top the README, Genrich computes p-values for each replicate and then combines them using Fisher's method. However, I can't really see how this can increase the significance to such a degree that the number of peaks goes from 0 to 46000. Therefore, I would suspect that something is wrong in my code or in the way I perform the peak calling.

I tested to run peak calling on each of the replicates with MACS2, using a threshold of q<0.05. This gave about 15.000 significant peaks for each.

jsh58 commented 4 years ago

Understanding why Genrich calls the peaks it does for a sample (or group of samples) can be gained by examining a -f log file. See, for example, #23. In that case, 0 peaks were called with a q-value threshold, due to p-values not achieving a low enough value to overcome Benjamini-Hochberg correction.

Please also read the description of the All q-values are 1 warning message here.

Simso86 commented 4 years ago

Thanks. I have consulted #23 and the description of the warning message in README. I understand that 0 peaks are called (even if setting q < 1) because no p-values are low enough to survive Benjamini-Hochberg correction. That is not the main issue. The question is why the p-values are so high in the first place. The samples have been analysed before, always yielding significant peaks (q < 0.05) for each of the replicates.

Calling peaks for the same samples using MACS2 gives about 15.000 peaks that survives Benjamini-Hochberg correction, which is further reduced to just 2000 peaks after IDR correction. However, Genrich identifies no peaks for either of the samples as surviving Benjamini-Hochberg correction (all q < 1), but it reports 46.000 peaks when running the replicates together.

So in one way, Genrich seem to be much more stricter than MACS2 when calling peaks for each of the replicates separately. But on the other hand, when running replicates together it increases the number of significant peaks a lot (as opposed to IDR).

jsh58 commented 4 years ago

The calculation of p-values by Genrich is explained in the README, and there is also some extended discussion in #11.

The reason why MACS2 has lower p-values than Genrich, in general, is that MACS2 uses a fundamentally different null model, based on the Poisson distribution. As I have written before:

the Poisson distribution [...] is frequently used as a null model in genomics software but was a good fit to the control pileup distribution never.

The reason why multiple replicates can lead to more peaks with Genrich has been addressed.