jvanheld commented 3 months ago

Some thoughts about the evaluation of the PFM submitted by applicants

PFM: position frequency matrix
PWM: position weight matrix

General notes on position matrices Only position frequency matrices (PFMs) are accepted. Before benchmarking, a pseudocount of 0.00001 is added to each value of a PFM. Internally, PFMs are additionally converted to log-odds position weight matrices (PWMs):

$$ log2( [\text{pfm} + \text{1e-5}] / 0.25)$$

In each benchmarking protocol, PWM scanning is performed with PWMEval (Ambrosini et al. 2020a) in two modes: PWM best-hit and PFM sum-occupancy. For PWMs, PWMEval accepts only integer values, so each PWM value is truncated to 5 digits after the decimal point and multiplied by 105. The sum-occupancy (Orenstein and Shamir 2014) estimation is computed as in (Ambrosini et al. 2020b). The PWM width (i.e. the minimal length of the scorable nucleotide 'substring') must be from 5 to 30, see also the file format details.

jvanheld commented 3 months ago

Benchmarking on genomic regions from ChIP-Seq and genomic HT-SELEX

Handling replicates: when available, the data from experimental replicates are provided independently for training. As test data, we use the single replicate with the largest number of peak calls. For GHT-SELEX, we use the results from a single cycle of a single replicate that yielded the largest number of peaks.

Why to choose as test the largest dataset ? It might contain more false positives than the other replicates.
For HGT-selex, why to use the results for a single cycle rather than after 2 or 3 cycles of amplification / selection ? First cycle data is likely to contain a lot of false positives.

jvanheld commented 3 months ago

MAUPRC and MAUROC

It is largely recognised that he AUROC is a very bad metric when the positive and negative sets are highly unbalanced. This is clearly the case for the evaluation of binding sites, where the number of non-binding positions in a genome is way higher than the number of binding positions. This imbalance is partly reduced by assigning a single score to each sequence (positive or negative) rather than to each position of each sequence. With PWMeval, the evaluation will rely either on the top score (PWM best-hit) or on the cumulative score (PFM sum-occupancy). However, there is still a strong imbalance between the number of positive and negative sets (peaks, PBM oligonucleotides, ...). I thus keep thinking that the AUROC should be avoided.

AUPRC is probably a more suitable metric, I don't understand why both have to be computed.

At the end of the challenge, it would be interesting to run some analyses in order to compare these two scores across all the submissions from all the applicants. In the mean time we need to find the best way to achieve good scores on our side.

@brunocontrerasmoreira

For matrix-quality, we should activate specific options to retain only the top-scoring hit in each sequence of the input sets. This can be done by adding the following argument to matrix-quality, which will pass it to matrix-scan: -uth rank 1. As explained in matrix-scan doc (matrix-scan -h), setting the upper threshold to 1 on the rank will only return the top-ranking site for each sequence. Unfortunately, this option does not work with matrix-scan-quick, and it is likely not working either with the option -return distrib.

An alternative : directly scan the sequences with matrix-scan (slow Perl implementation, but more flexible), with the option -uth rank 1. We could scan separately the train sequence and some negative set built according to IBIS protocol, and then draw ROC and PRC curves.

jvanheld commented 3 months ago

Background model

The formula in the benchmarking protocol assumes equiprobable and independent nucleotides. We know that this is far from being the case for Human regulatory sequences, which are better modelled with first-order Markov models, in order to take into account some string dependencies such as AA and TT aggregation, and CpG depletion (except in CpG islands).

@brunocontrerasmoreira : this means that for our own anlayses with matrix-quality, we should better use a model of identically and independently distributed (IID) nucleotides.

In contrast, for motif discovery I really think we need to keep higher order Markov models as defined by peak-motifs.

Impact of the background model

Not surprisingly, the choice of an equiprobable model strongly changes the distribution of scores produced by matrix-quality.

For example, for the CHS dataset THC_0866, transcription factor GABPA, we previously used a Markov model of order 1 with parameters estimated from the CHS leaderboard test dataset.

Here is the matrix-quality result for the top-scoring matrix.

When the same motif is analysed with equiprobable nucleotides, the curves are quite different.

Conclusion

These sequences have 54% A+T and 46% C+G, but, more importantly, there are dependencies between successive nucleotides. The first-order Markov model thus seems more appropriate, but since the benchmarking protocol uses an equiprobable model I adapt the scripts to run it as well for our analyses.

jvanheld commented 3 months ago

Pseudo-counts

The interest of pseudo counts as defined by Hertz and Stormo is that they enable to relativise the confidence in the training data (the aligned binding site sequences used to estimate the nucleotide frequencies at each position of the matrix). IBIS submissions require to provide relative frequencies rather than count matrices. This means that we loose an important information about the total counts at each position (i.e. the number of sites used to build the matrix).

Besides, the pseudo-frequency used here ($\text{1e-5}$) is particularly low. The theoretical criteria for choosing the pseudo-weight (pseudo-count, pseudo-frequency) has not been very much studied, but this should be reflecting the confidence one has on the training data, and thus the absolute counts rather than the relative frequencies. Yves Moreau's group (2000) proposed an interesting empirical rule: set the total pseudo-counts value to the square-root of the total count of the column. This rule seems a bit pragmatic, but it is consistent with a Poisson-based modelling of the counts, since the standard deviation of the Poisson distribution equals the square root of its mean (the variance equals the mean).

jvanheld / IBIS_2024

IBIS Benchmarking protocol - some thoughts #10

Some thoughts about the evaluation of the PFM submitted by applicants

Benchmarking on genomic regions from ChIP-Seq and genomic HT-SELEX

MAUPRC and MAUROC

Background model

Impact of the background model

Conclusion

Pseudo-counts