Fix an edge case tokenizing with ignore unknown

Previously the following code would fail to tokenize the 2nd string even though ignore_unk is set to True.

import mlx.data.core
from mlx.data.core import CharTrie

if __name__ == "__main__":
    vocab = CharTrie()
    vocab.insert("hello")
    vocab.insert(" world")

    tokenizer = mlx.data.core.Tokenizer(vocab, ignore_unk=True)
    print(tokenizer.tokenize_shortest("hello world"))
    print(tokenizer.tokenize_shortest("hello "))

ml-explore / mlx-data

Fix an edge case tokenizing with ignore unknown #60