Add BPE tokenizers - Githubissues

This PR is on top of #61 and I will rebase to simplify the diff once that is merged. Please ignore "replace" related code here.

The changes in this PR are to support an implementation of BPE tokenizers. It adds a bunch of functionality to the Trie and it also adds a BPEMerges data structure which is a thin wrapper on top of a map of maps. No backwards incompatible changes anywhere except read_trie_from_spm which now doesn't change the space character and results in closer tokenizations to SPM even when not using BPE.

Trie

Most things are moved to work with iterators internally which removes a bunch of std::vector<char> creations and copies.
Add search_longest_prefix which the Trie is perfect for
Add the ability to set the id when inserting
Changed the vector that holds the keys to an unordered_map to support the above

BPE

BPETokenizer::tokenize would be the most interesting function. It is not the prettiest implementation but it is pretty fast and beats SPM on my laptop. Possible room for improvement lines 135-160 where we search for neighbors with linear search.
read_bpe_from_spm ironically implements a small bpe in python to extract the merges from the file.

TL;DR

The following is implementing SPM tokenization so far with exactly identical results as spm or HF.

symbols, merges = read_bpe_from_spm("tokenizer.model")
ds = (
    ds
    .pad("text", 0, 1, 0, ord(" "))
    .replace("text", " ", "\u2581")
    .tokenize_bpe("text", symbols, merges)
)

ml-explore / mlx-data

Add BPE tokenizers #62

Trie

BPE

TL;DR