Closed eu9ene closed 1 year ago
We recently updated COMET to 2.1.1. from 1.1.3. It uses different models. We should recalculate COMET scores for all languages for Bergamot, Microsoft and Google. I suggest implementing #91 first to make sure we don't translate everything every time we change metrics.
I opened #107 to test the hypotesis that the score change is due to the update of Comet.
Yes, we can see that the recalculated scores are indeed much higher than before.
For some translators, we can see the opposite results for BLEU and COMET when compared to Bergamot. This is mostly relevant for open-source translators.
For example, for cs-en we can see:
BLEU argos -33% nllb -27% COMET argos +47% nllb +51%
There might be a bug somewerhe.
BLEU prod:
COMET prod: