ml-explore / mlx-data

Efficient framework-agnostic data loading
MIT License
362 stars 40 forks source link

Add BPE tokenizers #62

Open angeloskath opened 6 months ago

angeloskath commented 6 months ago

This PR is on top of #61 and I will rebase to simplify the diff once that is merged. Please ignore "replace" related code here.

The changes in this PR are to support an implementation of BPE tokenizers. It adds a bunch of functionality to the Trie and it also adds a BPEMerges data structure which is a thin wrapper on top of a map of maps. No backwards incompatible changes anywhere except read_trie_from_spm which now doesn't change the space character and results in closer tokenizations to SPM even when not using BPE.

Trie

BPE

TL;DR

The following is implementing SPM tokenization so far with exactly identical results as spm or HF.

symbols, merges = read_bpe_from_spm("tokenizer.model")
ds = (
    ds
    .pad("text", 0, 1, 0, ord(" "))
    .replace("text", " ", "\u2581")
    .tokenize_bpe("text", symbols, merges)
)