Datasets with no good motifs

jvanheld commented 4 months ago

For some datasets, oligo-analysis returns highly significant k-mers, but

the assembled matrix is of poor complexity (basically, the matrix contains a succession of Gs)
position-analysis returns a weak significance, indicating that there is no strong positional bias of this "motif"

For this type of peak sets, we have to evaluate whether it is better to send motifs (PSSM) or to use radically different approaches. I think that for this data type, a possibility would be to apply supervised classification based on a table of k-mer counts in each peak. If we have time (which is far from sure) we could test this.

brunocontrerasmoreira commented 4 months ago

Hi jacques, are you using background frequencies of masked ref genomes?

jvanheld commented 4 months ago

Hi Bruno,

I am estimating the k-mer prior probabilities based on a Markov model of order k-2 (or lower if the peak set size is too small).

It would be interesting to evaluate the impact of the BG model by comparing the motifs discovered with different alternatives

Markov with transition probabilities estimated from the input peak sequences
Ref genome
Masked ref genome

Masked ref genome might be interesting but will be a mixture of different sequence types, most of which might have different compositions than the peaks. In my experience, using the peaks themselves gives better results for oligo-analysis but we could test the alternatives;

However, this will only modify the results of oligo-analysis, and the fact that position-analysis returns only weakly significant motifs suggests that there is a more fundamental problem with these peaks, and that PWMs may be suboptimal to classify peaks as regulated or not by the same TF.

jvanheld / IBIS_2024

Datasets with no good motifs #1