What's the difference between setting "--tokenize" to "flores101" and setting it to "flores200"?

mjpost / sacrebleu

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons

Apache License 2.0

1.03k stars 162 forks source link

What's the difference between setting "--tokenize" to "flores101" and setting it to "flores200"? #243

Closed Phuoc-Hoan-Le closed 10 months ago

Phuoc-Hoan-Le commented 10 months ago

What's the difference between setting "--tokenize" to "flores101" and setting it to "flores200"? Are they the same? They both use SentencePiece tokenizer, right?

If not when is it appropriate to use "flores101" over "flores200" when testing on other datasets beside flores-101/200?

martinpopel commented 10 months ago

Both use SentencePiece, but these are different models (trained on different datasets, Flores101 and Flores200), see https://github.com/mjpost/sacrebleu/blob/bafca9df5b933770aa5c7fb7d858acd34ac43d4c/sacrebleu/tokenizers/tokenizer_spm.py#L20-L27

Both tokenizers can be used for any dataset and language, but for languages that were not included in the training data of a given tokenizer, we cannot expect reasonable tokenization.