huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.69k stars 747 forks source link

When decoding an English sentence with the 'add_prefix_space' parameter set to 'False,' how can I add spaces? #1362

Closed enze5088 closed 8 months ago

enze5088 commented 9 months ago

I train a tokenizer and set 'add_prefix_space' to 'False', How can I ensure that BBPE tokenizers correctly handle space division when decoding a sequence ?

normalizer = normalizers.Sequence([NFC(), StripAccents()])
tokenizer.normalizer = normalizer
tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
    [Whitespace(), Punctuation(), Digits(individual_digits=True), UnicodeScripts(),
     ByteLevel(add_prefix_space=False, use_regex=True), ])
tokenizer.decoder = decoders.ByteLevel(add_prefix_space=False, use_regex=True)
tokenizer.post_processor = tokenizers.processors.ByteLevel()
ArthurZucker commented 9 months ago

Hey! could you elaborate on How can I ensure that BBPE tokenizers correctly handle space division when decoding a sequence what is your concern / issue here?

enze5088 commented 9 months ago

I aim to develop a multilingual tokenizer. However, when processing multilingual text, especially text lacking space-based segmentation, like Chinese, it occasionally introduces erroneous spaces before certain characters. If I add whitespace in the pre-tokenizer, the tokenizer will not correctly preserve the spaces during the decoding of generated English text."

ArthurZucker commented 9 months ago

Ok, the additional space addition is fixed by #1357! You should give it a try!

enze5088 commented 8 months ago

Ok, the additional space addition is fixed by #1357! You should give it a try!

Thanks