Closed enze5088 closed 8 months ago
Hey! could you elaborate on How can I ensure that BBPE tokenizers correctly handle space division when decoding a sequence
what is your concern / issue here?
I aim to develop a multilingual tokenizer. However, when processing multilingual text, especially text lacking space-based segmentation, like Chinese, it occasionally introduces erroneous spaces before certain characters. If I add whitespace in the pre-tokenizer, the tokenizer will not correctly preserve the spaces during the decoding of generated English text."
Ok, the additional space addition is fixed by #1357! You should give it a try!
Ok, the additional space addition is fixed by #1357! You should give it a try!
Thanks
I train a tokenizer and set 'add_prefix_space' to 'False', How can I ensure that BBPE tokenizers correctly handle space division when decoding a sequence ?