jvanheld / IBIS_2024

Participation to the IBIS nebchmarking for motif discovery approaches
GNU General Public License v3.0
0 stars 0 forks source link

optimize matrices according to their capability to discriminate a positive from a negative dataset #19

Closed jvanheld closed 2 months ago

jvanheld commented 3 months ago

Rather than selecting matrices based on their matrix-quality curves, I thought that it would be more consistent to select those that best meet the evaluation criterion of the IBIS challenge

This benchmark assesses the performance in solving the binary classification problem of discriminating ChIP-Seq or GHT-SELEX peaks (positives) from selected negative sequences.

Positives: 301-bp long regions centered on the peak summits of the technically reproducible ChIP-Seq peaks, which are the macs2 peak calls supported by (i.e. overlapping with) peak calls of any other peak callers (sissrs, cpics, gem).

Negatives:

  • 'shades': regions located in the vicinity of the ChIP-Seq peaks. To generate shades, full-length peaks shorter than 300bp are extended in both directions to cover 300bp regions, for each resulting region, we create one 300bp shade region located at the random distance of 300-600bp from the region borders, the exact location and upstream/downstream placement are chosen randomly.
  • 'aliens': peaks of non-related proteins not overlapping any reproducible peaks of the target transcription factor.
  • 'random': random genomic regions with matched %GC composition.

I also thought that we could apply some optimization method (in particular genetic algorithm) in order to optimize the discovered matrices relative to this criterion.

For this, I asked my son to develop with me a new tool named optimize-matrices-GA. The code is here https://github.com/pvhelden/optimize-matrix-GA/ (I will incorporate it in RSAT after the challenge).

I initially ran it with train sequences versus random genomic fragments (but I did not filter them according to their %GC composition).

I am now trying to run a more specific approach by optimizing the discrimination between the sequences bound by the TF of interest in a given type of experiment (e.g. CHS, GHTS, ...) and the sequences bound by all the other TFs in the same experiment.

jvanheld commented 2 months ago

Done