[Feature request] CJK tokenizer for char-level tokenized BLEU

chenyangh commented 2 years ago

Hi,

I have been recently working on WMT'20 EN-JA dataset, and I am wondering if we can add a character-level tokenizer (in stead of ja-mecab) to facilitate fair comparison on this task.

Existing literature on EN-JA used char-level BLEU on the test set[1, 2, 3], following the rules of the WMT'20 competition.

I have attempted to use the zh tokenizer for this purpose. However, the script ignores Katakana and Hiragana characters. Our current solution (suggested by @zhengzx-nlp) is to add (u'\u3040', u'\u30ff') to the _UCODE_RANGES of https://github.com/mjpost/sacrebleu/blob/2787185dd0f8d224c72ee5a831d163c2ac711a47/sacrebleu/tokenizers/tokenizer_zh.py#L45

I wonder can we add a similar feature to the existing tokenizers? I am thinking we can either add a cjk tokenizer or modify the existing zh tokenizer. The former could be better for back-compatibility reasons.

martinpopel commented 2 years ago

tokenizer_zh separates Chinese characters and then tokenizes the non Chinese part using tokenizer_13a. So if there is e.g. an English name in a Chinese sentence, each word of the name remains a single token. This was needed for full compatibility with a legacy Chinese-BLEU evaluation (which is also the reason for listing Chinese _UCODE_RANGES instead of using Unicode properties and possibly including all CJK characters).

The papers you cite say "BLEU scores are character-level." or "charBLEU". I would interpret this as tokenizing all characters, including those in latin script names. For this, we already have sacrebleu --tokenize char (tokenizer_char.py).

Of course, there is a question whether character-level BLEU (limited by char 4-grams and the BLEU algorithm focused on precision with brevity penalty instead of recall) is suitable (i.e. correlates with human evaluation) enough and why not use e.g. chrF (where the default --chrf-char-order is 6).

zhengzx-nlp commented 2 years ago

Hi @martinpopel

Thanks for your reply.

In terms of "character-level BLEU" for Chinese and Japanese texts, what we mentioned is the way that is exactly the same as tokenizer_zh does: separating CJK characters and remaining non-CJK ones for 13a tokenization. This makes sense as the underlying purpose of this is to avoid ambiguity introduced by different segmentation tools for Chinese and Japanese.

Our issue arises from that _UCODE_RANGES lacks the coverage of the commonly-used full-width Japanese Katakana and Hiragana (e.g., ひらがな and カタカナ), whereas it does include the half-width kana's (e.g., ｶﾀｶﾅ, https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/tokenizers/tokenizer_zh.py#L54).

Plus, it does not seem to make sense to use tokenizer_char for Japanese/Chinese texts where Latin scripts also get separated.

Thus we would like to ask for extending _UCODE_RANGES to include an additional range of (u'\u3040', u'\u30ff') [1] to support Katakana and Hiragana.

Many thanks!

Zaixiang

Reference: [1] https://en.wikipedia.org/wiki/Kana#In_Unicode

ozancaglayan commented 2 years ago

It sounds good to me to have a separate tokenizer that extends the ranges with the ones you suggested, to not change the current zh tokenizer.

mjpost / sacrebleu

[Feature request] CJK tokenizer for char-level tokenized BLEU #171