Looking at the tokenizer.json as on huggingface hub there is quite a lot of tokens that decode to unkown that probably shouldn't? Is this expected or am I making a mistake during decoding?
Additional Context
There are 606 tokens that are output as UNK:
The issue was originally found when using PreTrainedTokenizerFast
Python -VV
Pip Freeze
Reproduction Steps
Output:
�
Expected Behavior
Looking at the tokenizer.json as on huggingface hub there is quite a lot of tokens that decode to unkown that probably shouldn't? Is this expected or am I making a mistake during decoding?
Additional Context
There are 606 tokens that are output as UNK:
The issue was originally found when using
PreTrainedTokenizerFast
Suggested Solutions
No response