Closed ryonakamura closed 3 years ago
This is expected behavior. Please see https://github.com/google/sentencepiece/issues/215
<s>, <pad>, </s>
, and <unk>
are defined as 'control symbols' that should not be appear in the input.
We can workaround this restriction by defining them as 'user defined symbols'. However, we don't recommend it especially in user-facing product as user can tweak the model behavior by just injecting these tokens in the input.
--user_defined_symbols='<s>,</s>,<pad>,<unk>'
Hi, do you have any idea if I could add user_defined_symbols
to pretrained sp tokenizers? Thank you in advance! @taku910
Token to id conversion works fine, but text to ids conversion fails.
output:
Other special tokens fail to tokenize as well.
output:
Maybe I need to add
extra_options
? We setsp.SetEncodeExtraOptions("bos:eos")
, which is probably a function to automatically add<s>
and</s>
to the tokenized text, so the tokenization of the<s>
and</s>
in the input text still failed. Also,<pad>
and<unk>
are not supported.We cannot add
</s>
etc. touser_defined_symbols
now since we have already trained a large model.We are currently dealing with ad-hoc post-processing for tokenized text.
Our Settings
training:
loading: