jvanheld / IBIS_2024

Participation to the IBIS nebchmarking for motif discovery approaches
GNU General Public License v3.0
0 stars 0 forks source link

matrix-quality criterion #7

Closed jvanheld closed 3 months ago

jvanheld commented 4 months ago

With GHTS, peak-motifs returns many low-complexity motifs, as well as some relevant motifs. The same holds true with other data types, yet CHS provide more relevant results.

In any case, we could use an additional criterion : enrichment in high-scoring sites for the discovered motifs. This can be done with matrix-quality.

jvanheld commented 4 months ago

The idea would be to assess the enrichment by observing the curves returned by matrix-quality.

brunocontrerasmoreira commented 4 months ago

We should add this step in the Makefile, I will be able to help from Monday!

jvanheld commented 4 months ago

I added the matrix-quality target and ran it for the 3 data types treated so far: CHS, GHTS, HTS. The results are interesting, we should discuss them.

Some discovered motifs are actually not enriched by empoverished compared to the test sets. For some other motifs, the permutation test clearly shows that the motif has a low complexity: the score distribution curve for permuted motifs follows exactly the theoretical curve.

jvanheld commented 3 months ago

The matrix-quality plots are informative but do not allow to automatically select the best motifs for each dataset.

I thought about a different approach: selecting matrices based on their performances in discriminating train sequences from other sequences. I also thought about a genetic algorithm that would optimize the matrices according to their AuROC in a train versus others comparison. This directly fits the evaluation criteria of IBIS.

I open another issue (https://github.com/jvanheld/IBIS_2024/issues/19) about this matrix optimization approach.