Improve automatic quality evaluation

eu9ene commented 3 months ago

This is a meta issue to brainstorm ideas on how to make our final automatic quality evaluation procedure more robust and suitable for decision-making. This currently happens in firefox-translations-models repo but will be migrated to the main repo and W&B.

See this doc.

Some ideas:

Consider reporting absolute difference instead of %
Consider adding more metrics recommended in the literature
Consider adding more interpretable metrics where absolute values indicate readiness for release, for example, xCOMET
Try not to use validation datasets where possible (remove flores-dev)
Inspect how representative the datasets are compared to what users translate in the browser
Use the same datasets for all languages as a final metric ( either only Flores or build a new dataset)

Recommendations from the paper Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies:

Use CometKiwiQE22 as the main metric. In addition to its better performance, as a quality estimation metric, it is not affected by references.
Use at least one additional metric of a different type; e. g. BLEURT20, which is reference-based and uses a different architecture from Comet.
For each metric delta, report estimated accuracy to help align reliability of used metrics.
Do not use BLEU, ChrF, or spBLEU to evaluate unrelated systems.

We can add a task list here after migrating [firefox-translations-models].

marco-c commented 3 months ago

See also #229.

ZJaume commented 1 week ago

Maybe also:

Add pairwise statistical significance tests. For comet use comet-compare and for chrF use sacrebleu --paired-ar.

EDIT: this will help a lot when a couple of models have very close metric values. Or for test sets like Flores, which are quite short (only 1000 sentences), the significance test may be important because sometimes 0.5 chrF difference is not statistically significant.

eu9ene commented 1 week ago

We already run comet-compare in the evals repo but it doesn't look very useful so far because our models are usually quite a bit behind Google and Microsoft, so it almost always looks like this:

wmt20.cs-en

    wmt20.microsoft.en outperforms wmt20.bergamot.en.
    wmt20.google.en outperforms wmt20.bergamot.en.

We do not run this with SacreBLEU though, so it's something we should add.

mozilla / translations

Improve automatic quality evaluation #764