google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.32k stars 1.18k forks source link

Sequence of byte '<0x09>' as token #982

Closed SeverinoDaDalt closed 9 months ago

SeverinoDaDalt commented 9 months ago

Hi, my issue is simple. I want to add sequences of the tab byte ('<0x09>') as user_defined_tokens, for example a sequence of length 2 would be '<0x09><0x09>'. I tried the following:

What is the correct way to go?

P.D: I am using the python module.

taku910 commented 9 months ago

The error message is just a warning, so you can simply ignore it. .vocab file is stored in TSV, so using tab in the vocab will break the compatibility, but that is not a big issue, as .vocab file is not used in the tokenization. It is just a human-readable reference.

In addition, you might need to disable the default normalization or suppress the rules around the tab characters. as tab characters are normalized into whitespaces by default.