ml-explore / mlx-data

Efficient framework-agnostic data loading
MIT License
362 stars 40 forks source link

Fix an edge case tokenizing with ignore unknown #60

Closed angeloskath closed 6 months ago

angeloskath commented 6 months ago

Previously the following code would fail to tokenize the 2nd string even though ignore_unk is set to True.

import mlx.data.core
from mlx.data.core import CharTrie

if __name__ == "__main__":
    vocab = CharTrie()
    vocab.insert("hello")
    vocab.insert(" world")

    tokenizer = mlx.data.core.Tokenizer(vocab, ignore_unk=True)
    print(tokenizer.tokenize_shortest("hello world"))
    print(tokenizer.tokenize_shortest("hello "))