Closed SeverinoDaDalt closed 9 months ago
The error message is just a warning, so you can simply ignore it. .vocab file is stored in TSV, so using tab in the vocab will break the compatibility, but that is not a big issue, as .vocab file is not used in the tokenization. It is just a human-readable reference.
In addition, you might need to disable the default normalization or suppress the rules around the tab characters. as tab characters are normalized into whitespaces by default.
Hi, my issue is simple. I want to add sequences of the tab byte ('<0x09>') as
user_defined_tokens
, for example a sequence of length 2 would be '<0x09><0x09>'. I tried the following:user_defined_token
. It obviously does not work since it is assuming they are multiple chars, not bytes.user_defined_token
. It works but a warning is raised:trainer_interface.cc(706) LOG(WARNING) The piece [ ] contains escaped characters that break the format of <MY_PATH>
What is the correct way to go?
P.D: I am using the python module.