huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.03k stars 797 forks source link

Issue with `SentencePieceUnigramTokenizer` Handling Unknown Tokens #1576

Open Munikumar09 opened 3 months ago

Munikumar09 commented 3 months ago

Description:

When using the SentencePieceUnigramTokenizer with a custom vocabulary, there is no attribute to handle the unk_id, causing errors when encoding text not present in the vocabulary.

Example:

Suggested Fix:

class SentencePieceUnigramTokenizer(BaseTokenizer):
    def __init__(
        self,
        vocab: Optional[List[Tuple[str, float]]] = None,
        replacement: str = "▁",
        add_prefix_space: bool = True,
        unk_id: int = 0,
    ):
        if vocab is not None:
            tokenizer = Tokenizer(Unigram(vocab, unk_id=unk_id))
        else:
            tokenizer = Tokenizer(Unigram())

        tokenizer.normalizer = normalizers.Sequence([
            normalizers.Nmt(),
            normalizers.NFKC(),
            normalizers.Replace(Regex(" {2,}"), " "),
        ])
        tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(replacement=replacement)
        tokenizer.decoder = decoders.Metaspace(replacement=replacement)

        parameters = {
            "model": "SentencePieceUnigram",
            "replacement": replacement,
            "add_prefix_space": add_prefix_space,
        }

        super().__init__(tokenizer, parameters)

Issue:

The current implementation does not handle the unk_id, leading to errors when encountering unknown tokens. Adding support for unk_id in the tokenizer initialization would resolve this issue.

Please let me know is there any existing solution to it.

ArthurZucker commented 2 months ago

Well, in this case the unk_id is missing from the vocab that you passed to the Tokenizer. Why not adding "<unk>":0 in the vocab?