Open chenyangh opened 2 years ago
tokenizer_zh
separates Chinese characters and then tokenizes the non Chinese part using tokenizer_13a
.
So if there is e.g. an English name in a Chinese sentence, each word of the name remains a single token.
This was needed for full compatibility with a legacy Chinese-BLEU evaluation (which is also the reason for listing Chinese _UCODE_RANGES
instead of using Unicode properties and possibly including all CJK characters).
The papers you cite say "BLEU scores are character-level." or "charBLEU". I would interpret this as tokenizing all characters, including those in latin script names. For this, we already have sacrebleu --tokenize char
(tokenizer_char.py).
Of course, there is a question whether character-level BLEU (limited by char 4-grams and the BLEU algorithm focused on precision with brevity penalty instead of recall) is suitable (i.e. correlates with human evaluation) enough and why not use e.g. chrF (where the default --chrf-char-order
is 6).
Hi @martinpopel
Thanks for your reply.
In terms of "character-level BLEU" for Chinese and Japanese texts, what we mentioned is the way that is exactly the same as tokenizer_zh
does: separating CJK characters and remaining non-CJK ones for 13a tokenization. This makes sense as the underlying purpose of this is to avoid ambiguity introduced by different segmentation tools for Chinese and Japanese.
Our issue arises from that _UCODE_RANGES
lacks the coverage of the commonly-used full-width Japanese Katakana and Hiragana (e.g., ひらがな and カタカナ), whereas it does include the half-width kana's (e.g., カタカナ, https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/tokenizers/tokenizer_zh.py#L54).
Plus, it does not seem to make sense to use tokenizer_char
for Japanese/Chinese texts where Latin scripts also get separated.
Thus we would like to ask for extending _UCODE_RANGES
to include an additional range of (u'\u3040', u'\u30ff') [1] to support Katakana and Hiragana.
Many thanks!
Zaixiang
Reference: [1] https://en.wikipedia.org/wiki/Kana#In_Unicode
It sounds good to me to have a separate tokenizer that extends the ranges with the ones you suggested, to not change the current zh
tokenizer.
Hi,
I have been recently working on WMT'20 EN-JA dataset, and I am wondering if we can add a character-level tokenizer (in stead of
ja-mecab
) to facilitate fair comparison on this task.Existing literature on EN-JA used char-level BLEU on the test set[1, 2, 3], following the rules of the WMT'20 competition.
I have attempted to use the
zh
tokenizer for this purpose. However, the script ignores Katakana and Hiragana characters. Our current solution (suggested by @zhengzx-nlp) is to add(u'\u3040', u'\u30ff')
to the_UCODE_RANGES
of https://github.com/mjpost/sacrebleu/blob/2787185dd0f8d224c72ee5a831d163c2ac711a47/sacrebleu/tokenizers/tokenizer_zh.py#L45I wonder can we add a similar feature to the existing tokenizers? I am thinking we can either add a
cjk
tokenizer or modify the existingzh
tokenizer. The former could be better for back-compatibility reasons.