This PR is on top of #61 and I will rebase to simplify the diff once that is merged. Please ignore "replace" related code here.
The changes in this PR are to support an implementation of BPE tokenizers. It adds a bunch of functionality to the Trie and it also adds a BPEMerges data structure which is a thin wrapper on top of a map of maps. No backwards incompatible changes anywhere except read_trie_from_spm which now doesn't change the space character and results in closer tokenizations to SPM even when not using BPE.
Trie
Most things are moved to work with iterators internally which removes a bunch of std::vector<char> creations and copies.
Add search_longest_prefix which the Trie is perfect for
Add the ability to set the id when inserting
Changed the vector that holds the keys to an unordered_map to support the above
BPE
BPETokenizer::tokenize would be the most interesting function. It is not the prettiest implementation but it is pretty fast and beats SPM on my laptop. Possible room for improvement lines 135-160 where we search for neighbors with linear search.
read_bpe_from_spm ironically implements a small bpe in python to extract the merges from the file.
TL;DR
The following is implementing SPM tokenization so far with exactly identical results as spm or HF.
This PR is on top of #61 and I will rebase to simplify the diff once that is merged. Please ignore "replace" related code here.
The changes in this PR are to support an implementation of BPE tokenizers. It adds a bunch of functionality to the
Trie
and it also adds aBPEMerges
data structure which is a thin wrapper on top of a map of maps. No backwards incompatible changes anywhere exceptread_trie_from_spm
which now doesn't change the space character and results in closer tokenizations to SPM even when not using BPE.Trie
std::vector<char>
creations and copies.search_longest_prefix
which theTrie
is perfect forid
when insertingunordered_map
to support the aboveBPE
read_bpe_from_spm
ironically implements a small bpe in python to extract the merges from the file.TL;DR
The following is implementing SPM tokenization so far with exactly identical results as spm or HF.