mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Improve automatic quality evaluation #764

Open eu9ene opened 3 months ago

eu9ene commented 3 months ago

This is a meta issue to brainstorm ideas on how to make our final automatic quality evaluation procedure more robust and suitable for decision-making. This currently happens in firefox-translations-models repo but will be migrated to the main repo and W&B.

See this doc.

Some ideas:

Recommendations from the paper Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies:

We can add a task list here after migrating [firefox-translations-models].

marco-c commented 3 months ago

See also #229.

ZJaume commented 1 week ago

Maybe also:

EDIT: this will help a lot when a couple of models have very close metric values. Or for test sets like Flores, which are quite short (only 1000 sentences), the significance test may be important because sometimes 0.5 chrF difference is not statistically significant.

eu9ene commented 1 week ago

We already run comet-compare in the evals repo but it doesn't look very useful so far because our models are usually quite a bit behind Google and Microsoft, so it almost always looks like this:

wmt20.cs-en

    wmt20.microsoft.en outperforms wmt20.bergamot.en.
    wmt20.google.en outperforms wmt20.bergamot.en.

We do not run this with SacreBLEU though, so it's something we should add.