Problems about the tokenizers

DoctorDream commented 2 years ago

I am doing a research on multilingual generations, and I find that it seems that the tokenizer used by sacrebleu can only split the sentence into separate words by Space,like: '今天是个好天气'->'今天是个好天气' 'What a nice day!'->'What a nice day!' So I would like to ask if my understanding is correct?

My current task is to find an independent multilingual tokenizer to limit the length of my sentences in different languages by tokens. If the tokenzier used by sacrebleu just split sentences into 'char-level' words, then i can't use it. Do you have some recommendations for linguistic tokenizers? Thanks a lot!

sukuya commented 2 years ago

@DoctorDream Following tokenizers can be used from sacrebleu.tokenizers

 _TOKENIZERS = {
    'none': 'tokenizer_base.BaseTokenizer',
    'zh': 'tokenizer_zh.TokenizerZh',
    '13a': 'tokenizer_13a.Tokenizer13a',
    'intl': 'tokenizer_intl.TokenizerV14International',
    'char': 'tokenizer_char.TokenizerChar',
    'ja-mecab': 'tokenizer_ja_mecab.TokenizerJaMecab',
    'spm': 'tokenizer_spm.TokenizerSPM',
}

e.g. for Chinese, character level segmentation is supported in sacrebleu

import sacrebleu.tokenizers.tokenizer_zh as tok
a = tok.TokenizerZh()
mandarin_text = '今天是个好天气'
tokens = a(mandarin_text)
print(tokens)

Output: 今天是个好天气

e.g. for Japanese, mecab tokenizer is supported.

import sacrebleu.tokenizers.tokenizer_ja_mecab as tok
a = tok.TokenizerJaMecab()
japanese_text = 'こんにちは、世界！'
tokens = a(japanese_text)
print(tokens)

Output: こんにちは、世界！

Sentencepiece (SPM) may be what you are looking for, spm support is already in master but not in Release2.0 of sacrebleu. If tokenizer is all you are interested in, you can use it directly from https://github.com/google/sentencepiece. You can train a spm model using your multilingual data and then use it for tokenization of your text. You can read the docs to gauge if it meets your needs. sacrebleu is ideally used just for computing bleu scores. You need sentence piece library to train spm models anyway.

mjpost commented 2 years ago

I agree with @sukuya, do not use the sacrebleu tokenizer for any real-world data. Its use is mostly legacy. For evaluation, our version 2.1 release will enable the -tok spm option. You can get that now by building from the manual release.

If you want real-world "linguistic" tokenizers, I would look into huggingface or spacy, which have lots of options.

mjpost commented 2 years ago

Closing this since it seems answered. Feel free to reopen if there are more questions.

rabeya-akter commented 11 months ago

I am trying to calculate BLEU score for bangla text. How can I use custom sentencepiece tokenizer using spm?

mjpost / sacrebleu

Problems about the tokenizers #192