jvanheld / IBIS_2024

Participation to the IBIS nebchmarking for motif discovery approaches
GNU General Public License v3.0
0 stars 0 forks source link

Peak center-depleted k-mers #6

Open jvanheld opened 2 months ago

jvanheld commented 2 months ago

RSAT position-analysis detects k-mers that present any type of positional heterogeneity by running, for each k-mer, a chi-squared homogeneity test on its profile of counts per positional windows compared to the position profile that would be expected if these k-mer occurrences would spread homogeneously along the sequences.

With peaks, this approach is very efficient to detect k-mers (e.g. 6nt and 7nt) that are enriched around peak centres, and which generally correspond to fragments of the binding motifs. A PSSM (PWM) is then built from k-mer assemblies.

However, the approach is not intrinsically built to specifically detect centrally-enriched k-mers. It detects more generally any type of positional heterogeneity along a positional profile of k-mer occurrences, e.g. central enrichment, central depletion, multiple peaks or valleys, periodic waves, ....

With some peaksets of the IBIS challenge ChIP-seq (CHS) data, position-analysis detects k-mers (and derives PWM thereof) that are centrally depleted. This is the case for ZNF362. An example with 6nt. I re-ran the command position-analysis with option -return graphs in order to produce the individual graphs of k-mer positional distributions (this option is not activated by default in peak-motifs because it would take too much space).

The most significant k-mer is AAAATA with the following profile.

image

The blue curve shows the distribution of occurrences along the peakset (the position 0 is the center of each peak). The green curve shows the profile expected under the null hypothesis (homogeneous distribution along the peaks). Since the peaks have unequal lengths, the number of sequences per window decreases with the distance from peak sets, explaining the typical hat shape of the expected distribution (green).

The same type of profiles is found for the other highly significant 6nt and 7nt detected by position-analysis in this peakset, which are all AT-rich.

image

The consequence is that in such cases one should absolutely not submit these matrices to the challenge because they correspond to motifs that are actually under-represented in the peak center.

jvanheld commented 2 months ago

@brunocontrerasmoreira and @najlaksouri , this is an interesting observation to take into consideration for the interpretation of peak-motifs results in ChiP-seq results.

brunocontrerasmoreira commented 2 months ago

This probably is related to the observation that the central regions of these peaks were in most cases purged and hard-masked with Ns by default