jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
182 stars 27 forks source link

understanding origin of extremely broad peaks #79

Closed malcook closed 2 years ago

malcook commented 2 years ago

I am at a loss to explain the origin of some selected extremely broad peaks called by genrich in area where none exists, at least by my visual inspection.

For example, peak 793 in track H3K27ac_LL_N_000.narrowPeak as pictured

image

Other peaks outside of this region on this (and other) chromosomes seem quite reasonable, e.g.

image

They were called by Genrich using commands (NB: repeating the single input control three times to go with the three ChIP tracks as I understand to be required)

mkdir peal/{pq,a0_q.05_g500_l100}

Genrich \
-t 'ali/H3K27ac_LL_N_0000.3.n.bam ali/H3K27ac_LL_N_0000.1.n.bam ali/H3K27ac_LL_N_0000.4.n.bam' \
-c 'ali/Input_LL_N_0000.1.n.bam ali/Input_LL_N_0000.1.n.bam ali/Input_LL_N_0000.1.n.bam' \
-f peak/pq/H3K27ac_LL_N_0000.narrowPeak.pq.bedgraph \
-r \
-y \
-E blacklist/blacklist.bed \
-e 'chrM' \
-X \
-q .99 \
-v \
&> peak/pq/H3K27ac_LL_N_0000.narrowPeak.pq.bedgraph.stderrout

Genrich -a 0 -q .05 -g 500 -l 100   -P -o peak/a0_q.05_g500_l100/H3K27ac_LL_N_0000.narrowPeak -f peak/pq/H3K27ac_LL_N_0000.narrowPeak.pq.bedgraph -v &> peak/a0_q.05_g500_l100/H3K27ac_LL_N_0000.narrowPeak.stderrout

Can you advise how to best confirm the Genrich is performing (or not) as documented/expected?

I can make any of these files available to you via ftp if it would you help us, or I can perform any diagnostics you might suggest.

jsh58 commented 2 years ago

Thanks for the question and the visualizations. You can check Genrich's calculations via the -f log file, and visualize the tracks like Figure 1 of the README. I suspect that your peak-calling parameters are leading to the broad peaks: with -a 0, any genomic position that achieves significance will be counted, and with -g 500, any significant genomic positions within 500bp of each other will be linked together into a peak.

malcook commented 2 years ago

While seeking to visualize the track as suggested I find that some regions unexpected have NA as value in all value columns (log_P, log_Q, and combined scores). Inspecting such regions they appear to all be places were the control Input exceeds the IP. viz on such:

image

Should NAs be expected in any cases ever?

Regardless, I think this is probably unrelated to my current issue of extremely broad peaks.... continuing...

The red track below is your suggested visualization in one such region (chr1:16,173,800-16,177,300).
image

Here is the whole chromosome context of that peak: image

Indeed, throughout the region the combined q > -log(.05) explains why it is being called a peak. However, I fail to understand how it could be getting such a q value in the first place. Indeed inspecting the .bam coverage shows large sub-segments of very low or even zero coverage that are giving a significant combined q value.

Can you explain such significant combined q scores in such circumstances?

I must note also that Genrich appear to be performing quite well with these parameters for the most part. Example: image

I am comfortable in principal with using -a 0 theoretically. After all, if the stats are good, significance is, erhm, significant. Anyway, I played with many variations on the parameters and dialing up -a does not make these peaks go away, as I'm sure you can expect from looking at the plot of -log10(q).

My challenge now is to understand what it is about the signal in regions such as this that yields such apparent anomalies. Any further lines of investigation are much welcomed.

note: my .bam alignments were created using STAR and multimappers are filtered out.

jsh58 commented 2 years ago

Should NAs be expected in any cases ever?

NAs are usually indicative of excluded regions. Please check your -E blacklist/blacklist.bed.

However, I fail to understand how it could be getting such a q value in the first place.

Check the -f file. The visualization you provided over a 59Mbp window looks pretty convincing to me (and maybe -g 500 is less of an issue here). If you want to evaluate specific regions based on coverage values and the calculations of p- and q-values, the -f file will provide those.

malcook commented 2 years ago

Thanks - your explanation for the NAs vis a vis our Blacklist is on target. Makes sense.

Checking the -f file in full is informative.

I've turn it into An IGV file (simply by adding a dummy 4th column), and indexing that as an .tdf file, and loading it into IGV, and the results for all of chromosome "1" displays as:

image

Above, in the top pane, the Generich p/q values (with top-to-bottom order being the order passed to Genrich - as show in my 1st msg) followed by called peaks.

In second pane are RPM-normalized coverage (produced by STAR as bedgraph and subsequently reformatted to bigwig for display in IGV).

In the 3rd pane are the .bam alignments, but we're too zoomed out for IGV to show them.

We can see that one of my replicates (the one ending in ".1") has been assigned a very high -log_p value throughout the region harboring the extremely broad peak_291 (maxing out at 9.8). These high -log_p values correspond to p values small enough that when combined by the fisher test with the other (generally insignificant ) p values from the other two replicates a sufficiently small combined p, and ultimately adjusted q value, satisfies the .05 q value specified.

Visual inspection of the .bw reveals my three replicates are pretty much in agreement throughout chromosome "1". Specifically, I can't see how replicate ".1" merits such high log_p values.

Zooming in within peak_291 sufficiently for the .bam alignment to plot, there remains no explanatory difference between my replicates that I can see: image

Do you appreciate my dilemma?

Do you have any other diagnostics or lines of inquiry for me to develop.

jsh58 commented 2 years ago

With multiple replicates, you should examine the coverage values and the calculation of p-values via the -k <file>.

malcook commented 2 years ago

Are you proposing that the coverage values in the -k file might support the -logp values for that one sample and disagree qualitatively with the values displayed in STAR's coverage values (displayed in the middle bigwig tracks)? I will check this but will require re-running Genrich with -k, and I just want to confirm your reasoning in the mean time.

malcook commented 2 years ago

Indeed the pileups point to the underlying issue, being that my replicates were a mix of paired and single end reads. The extremely broad peaks stem from paired end reads whose best paired alignment span its ends. Whether this is due to an error in the genome assembly or some statistical anomaly of the PE alignment algorithm I do not know, but realigning as single end solves the issue. Of course it comes at the cost of losing Genrich's ability to perform PCR duplicate removal, and losing STARs ability to sometimes make better placements using the PE info. But I am closing this issue as I now understand the origin of these peaks, and visual inspection shows the loss is small. I suppose if Genrich had a new option, say -Y, to use PE info only for purpose of dup removal, but not for pile-up calculation, I would get the best of both worlds. Alternatively, I could align as PE, remove dups with Picard, and then edit the .bam to remove the FLAGs encoding PE. Alternatively, I could try to understand if there is a genome mis-assembly and fix that, but that is well beyond my charter now ;). Thanks for all your help.

jsh58 commented 2 years ago

OK, glad that you were able to figure it out!

So, STAR is marking these alignments as properly paired, even though the resulting fragment would have spanned ~30Mbp? If that is the case, you could try to get STAR not to report such alignments that way, or edit the BAM file yourself by changing the bitwise FLAG to indicate that such alignments should not be considered properly paired. But in either case, Genrich would also not consider the alignments properly paired for PCR duplicate removal.

Another alternative would be to run Genrich in a preliminary step with -R <file> to identify PCR duplicates, and then remove them from the analysis before a second Genrich run to detect enriched genomic regions. Similar to your suggestion of running Picard, but Genrich can be used just to identify PCR duplicates too (and I do use it that way, frequently, even outside genome enrichment assays -- it's much faster than Picard and uses less memory, and can analyze reads with multiple alignments).

malcook commented 2 years ago

Thanks for your follow-up. I find that yes, if I use STAR's option to cap insert sizes, the issue is resolved as well. I think this is the preferred resolution, but I appreciate your presenting alternatives... it educates me as to other strengths and possible uses of Genrich in the future.