Closed DoctorDream closed 2 years ago
@DoctorDream Following tokenizers can be used from sacrebleu.tokenizers
_TOKENIZERS = {
'none': 'tokenizer_base.BaseTokenizer',
'zh': 'tokenizer_zh.TokenizerZh',
'13a': 'tokenizer_13a.Tokenizer13a',
'intl': 'tokenizer_intl.TokenizerV14International',
'char': 'tokenizer_char.TokenizerChar',
'ja-mecab': 'tokenizer_ja_mecab.TokenizerJaMecab',
'spm': 'tokenizer_spm.TokenizerSPM',
}
e.g. for Chinese, character level segmentation is supported in sacrebleu
import sacrebleu.tokenizers.tokenizer_zh as tok
a = tok.TokenizerZh()
mandarin_text = '今天是个好天气'
tokens = a(mandarin_text)
print(tokens)
Output: 今 天 是 个 好 天 气
e.g. for Japanese, mecab
tokenizer is supported.
import sacrebleu.tokenizers.tokenizer_ja_mecab as tok
a = tok.TokenizerJaMecab()
japanese_text = 'こんにちは、世界!'
tokens = a(japanese_text)
print(tokens)
Output: こんにちは 、 世界 !
Sentencepiece (SPM) may be what you are looking for, spm
support is already in master but not in Release2.0 of sacrebleu
. If tokenizer is all you are interested in, you can use it directly from https://github.com/google/sentencepiece. You can train a spm
model using your multilingual data and then use it for tokenization of your text. You can read the docs to gauge if it meets your needs. sacrebleu
is ideally used just for computing bleu
scores. You need sentence piece
library to train spm
models anyway.
I agree with @sukuya, do not use the sacrebleu tokenizer for any real-world data. Its use is mostly legacy. For evaluation, our version 2.1 release will enable the -tok spm
option. You can get that now by building from the manual release.
If you want real-world "linguistic" tokenizers, I would look into huggingface or spacy, which have lots of options.
Closing this since it seems answered. Feel free to reopen if there are more questions.
I am trying to calculate BLEU score for bangla text. How can I use custom sentencepiece tokenizer using spm?
I am doing a research on multilingual generations, and I find that it seems that the tokenizer used by sacrebleu can only split the sentence into separate words by Space,like: '今天是个好天气'->'今 天 是 个 好 天 气' 'What a nice day!'->'What a nice day!' So I would like to ask if my understanding is correct?
My current task is to find an independent multilingual tokenizer to limit the length of my sentences in different languages by tokens. If the tokenzier used by sacrebleu just split sentences into 'char-level' words, then i can't use it. Do you have some recommendations for linguistic tokenizers? Thanks a lot!