mjpost / sacrebleu

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons
Apache License 2.0
1.06k stars 162 forks source link

Working on tokenized pairs? #244

Closed MostHumble closed 11 months ago

MostHumble commented 11 months ago

I'm trying to calculate the blue score for a low resource language, so I'm using a tokenizer that I've trained myself, is there a way to pass the tokenizer as a param? for now when I am passing the tokenized pairs (lists of tokens) I am getting the following error:

` --> 414 self._check_corpus_score_args(hypotheses, references) 416 # Collect corpus stats 417 stats = self._extract_corpus_statistics(hypotheses, references)

File /opt/conda/lib/python3.10/site-packages/sacrebleu/metrics/base.py:258, in Metric._check_corpus_score_args(self, hyps, refs) 255 err_msg = "refs should be a sequence of sequence of strings." 257 if err_msg: --> 258 raise TypeError(f'{prefix}: {err_msg}')

TypeError: BLEU: Each element of hyps should be a string. `

martinpopel commented 11 months ago

Using your own tokenizer is not recommended - it goes against the main idea of SacreBLEU that BLEU scores should be replicable and the evaluation should not depend on external tokenizers. The default tokenizer 13a splits basic punctuation tokens (and all space-separated tokens). If your language uses spaces to separate words and non-ASCII (Unicode) punctuation, you can use the --tokenize intl. Even if the tokenization is not linguistically adequate for some languages, the BLEU scores still correlate reasonably well with human evaluation. For Chinese, Japanese and Korean, you should use zh, ja-mecab and ko-mecab, respectively (which is the default if you specify these languages via --language-pair). For other languages with scripta continua, you can use --tokenize char, i.e. character-based tokenization. For the 200 languages covered by Flores200, you can use --tokenize flores200.

That said, if you really need your own tokenizer (e.g. for comparing the difference in BLEU against one of the methods described above), you can use space-separated tokens (" ".join(tokens)) as the input sentences for both the hypotheses and references and --tokenize none.