Open eu9ene opened 3 months ago
See also #229.
Maybe also:
comet-compare
and for chrF use sacrebleu --paired-ar
.EDIT: this will help a lot when a couple of models have very close metric values. Or for test sets like Flores, which are quite short (only 1000 sentences), the significance test may be important because sometimes 0.5 chrF difference is not statistically significant.
We already run comet-compare in the evals repo but it doesn't look very useful so far because our models are usually quite a bit behind Google and Microsoft, so it almost always looks like this:
wmt20.cs-en
wmt20.microsoft.en outperforms wmt20.bergamot.en.
wmt20.google.en outperforms wmt20.bergamot.en.
We do not run this with SacreBLEU though, so it's something we should add.
This is a meta issue to brainstorm ideas on how to make our final automatic quality evaluation procedure more robust and suitable for decision-making. This currently happens in firefox-translations-models repo but will be migrated to the main repo and W&B.
See this doc.
Some ideas:
Recommendations from the paper Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies:
We can add a task list here after migrating [firefox-translations-models].