When using the SentencePieceUnigramTokenizer with a custom vocabulary, there is no attribute to handle the unk_id, causing errors when encoding text not present in the vocabulary.
The current implementation does not handle the unk_id, leading to errors when encountering unknown tokens. Adding support for unk_id in the tokenizer initialization would resolve this issue.
Please let me know is there any existing solution to it.
Description:
When using the
SentencePieceUnigramTokenizer
with a custom vocabulary, there is no attribute to handle theunk_id
, causing errors when encoding text not present in the vocabulary.Example:
{'a': -1.23, 'b': -1.34, 'c': -1.45}
Suggested Fix:
Issue:
The current implementation does not handle the
unk_id
, leading to errors when encountering unknown tokens. Adding support forunk_id
in the tokenizer initialization would resolve this issue.Please let me know is there any existing solution to it.