I am attempting to compute chrF++ for a set of predictions and references. If I use sacrebleu cli (sacrebleu ref.eng_Latn.tok < pred.eng_Latn.tok -m bleu chrf --chrf-word-order 2), I find a significant difference when I use corpus_score with CHRF(word_order=2).corpus_score(preds, refs). I have double-checked the data in both cases, and it is correct and the same, so no issues there. Any reason why this is happening? Similarly, the BLEU scores (with BLEU().corpus_score(preds, refs)) also varies significantly. Are there some default params that I am missing?
I am attempting to compute chrF++ for a set of predictions and references. If I use
sacrebleu
cli (sacrebleu ref.eng_Latn.tok < pred.eng_Latn.tok -m bleu chrf --chrf-word-order 2
), I find a significant difference when I usecorpus_score
withCHRF(word_order=2).corpus_score(preds, refs)
. I have double-checked the data in both cases, and it is correct and the same, so no issues there. Any reason why this is happening? Similarly, the BLEU scores (withBLEU().corpus_score(preds, refs)
) also varies significantly. Are there some default params that I am missing?