Open jvanheld opened 2 months ago
@brunocontrerasmoreira and @najlaksouri , this is an interesting observation to take into consideration for the interpretation of peak-motifs results in ChiP-seq results.
This probably is related to the observation that the central regions of these peaks were in most cases purged and hard-masked with Ns by default
RSAT
position-analysis
detects k-mers that present any type of positional heterogeneity by running, for each k-mer, a chi-squared homogeneity test on its profile of counts per positional windows compared to the position profile that would be expected if these k-mer occurrences would spread homogeneously along the sequences.With peaks, this approach is very efficient to detect k-mers (e.g. 6nt and 7nt) that are enriched around peak centres, and which generally correspond to fragments of the binding motifs. A PSSM (PWM) is then built from k-mer assemblies.
However, the approach is not intrinsically built to specifically detect centrally-enriched k-mers. It detects more generally any type of positional heterogeneity along a positional profile of k-mer occurrences, e.g. central enrichment, central depletion, multiple peaks or valleys, periodic waves, ....
With some peaksets of the IBIS challenge ChIP-seq (CHS) data,
position-analysis
detects k-mers (and derives PWM thereof) that are centrally depleted. This is the case for ZNF362. An example with 6nt. I re-ran the commandposition-analysis
with option-return graphs
in order to produce the individual graphs of k-mer positional distributions (this option is not activated by default inpeak-motifs
because it would take too much space).The most significant k-mer is AAAATA with the following profile.
The blue curve shows the distribution of occurrences along the peakset (the position 0 is the center of each peak). The green curve shows the profile expected under the null hypothesis (homogeneous distribution along the peaks). Since the peaks have unequal lengths, the number of sequences per window decreases with the distance from peak sets, explaining the typical hat shape of the expected distribution (green).
The same type of profiles is found for the other highly significant 6nt and 7nt detected by
position-analysis
in this peakset, which are all AT-rich.The consequence is that in such cases one should absolutely not submit these matrices to the challenge because they correspond to motifs that are actually under-represented in the peak center.