Closed jvanheld closed 3 months ago
The idea would be to assess the enrichment by observing the curves returned by matrix-quality
.
We should add this step in the Makefile, I will be able to help from Monday!
I added the matrix-quality
target and ran it for the 3 data types treated so far: CHS, GHTS, HTS.
The results are interesting, we should discuss them.
Some discovered motifs are actually not enriched by empoverished compared to the test sets. For some other motifs, the permutation test clearly shows that the motif has a low complexity: the score distribution curve for permuted motifs follows exactly the theoretical curve.
The matrix-quality
plots are informative but do not allow to automatically select the best motifs for each dataset.
I thought about a different approach: selecting matrices based on their performances in discriminating train sequences from other sequences. I also thought about a genetic algorithm that would optimize the matrices according to their AuROC in a train versus others comparison. This directly fits the evaluation criteria of IBIS.
I open another issue (https://github.com/jvanheld/IBIS_2024/issues/19) about this matrix optimization approach.
With GHTS, peak-motifs returns many low-complexity motifs, as well as some relevant motifs. The same holds true with other data types, yet CHS provide more relevant results.
In any case, we could use an additional criterion : enrichment in high-scoring sites for the discovered motifs. This can be done with
matrix-quality
.