Sequence of byte '<0x09>' as token

google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

Apache License 2.0

10.32k stars 1.18k forks source link

Hi, my issue is simple. I want to add sequences of the tab byte ('<0x09>') as user_defined_tokens, for example a sequence of length 2 would be '<0x09><0x09>'. I tried the following:

passing '<0x09><0x09>' as user_defined_token. It obviously does not work since it is assuming they are multiple chars, not bytes.
passing '\t\t' as user_defined_token. It works but a warning is raised: trainer_interface.cc(706) LOG(WARNING) The piece [ ] contains escaped characters that break the format of <MY_PATH>
passing b'\t\t'. It does not work. When given a sentence with '\t\t', the tokenizer encodes it as two separate tokens.

What is the correct way to go?

P.D: I am using the python module.

google / sentencepiece

Sequence of byte '<0x09>' as token #982