Unicode normalization? - Githubissues

mjpost / sacrebleu

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons

Apache License 2.0

1.06k stars 162 forks source link

Unicode normalization? #224

Open davidweichiang opened 1 year ago

davidweichiang commented 1 year ago

Currently no Unicode normalization (e.g., NFKC or NFKD) is done, so that (say) für and für would not count as a match. Would it be possible to add this or would it break too many things?

ZJaume commented 1 year ago

Had similar issues with special unicode symbols. May not be suitable for every scenario but for that I used --tok spm, as SentencePiece already does NFKC normalization by default. The inconvenience is that BLEU scores tend to be higher (there are more tokens) and you cannot compare to other BLEUs using default tokenization unless you recompute scores with spm.