BPE tokenizer Implementation

albertpurnama commented 2 months ago

Hi,

I'm interested in contributing to implementing the BPE tokenizer.

Since we're using gpt-2 encoding (as shown in the preprocessors), I think we can use the original implementation of tiktoken-go. But, this means that we cannot yet make our very own Tokenizer training (to put out our own BPE ranks).

This might be important if people want to experiment with changing the vocabulary size and "train" their own tokenizer. However, it might also be the case that it is out of the scope of this repo's (or llm.c) purpose.

Maybe to start, can you elaborate on why the existing trie encoding is incorrect? Is the merging of tokens not done properly according to gpt-2 original tokenizer spec? Is it a performance issue to trie structure in general?

joshcarp commented 1 month ago

Hey, that would be awesome if you could contribute that

joshcarp commented 1 month ago

The reason why the current implementation is incorrect is because it doesn't respect merge order. In BPE tokenization should be done on the merge order which corresponds to the most->least frequent merge pairs in the training data.

For example, if we were to take BPE encoding of the string "00000000000000000" (0*17):

split: ('0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0')
for each pair, (i += 2), check merge rules -> merge
('00', '00', '00', '00', '00', '00', '00', '00', '0')
('00', '00', '00', '00', '00', '00', '00', '000')
('0000', '0000', '0000', '00', '000')
('00000000', '0000', '00', '000')
('00000000', '000000', '000')

so final tokenization:[8269, 10535, 830] corresponding to the strings ['00000000', '000000', '000']

The trie implementation only looks for longest prefix, so this would go and look down the path of 0000000000000000, find that there isn't a longest path, and then return 0000000000000000 as the first token and 0 as the second token: tokens: [25645, 15] strings: ['0000000000000000', '0']

The differences in these two approaches are basically:

BPE merge rules ensure that you've got the most frequent tokens
Longest prefix ensures you've got the longest tokens It works because it's still valid in the vocabulary, but it means you're using less frequent tokens as inputs which would be sub optimal

Hk669 commented 3 weeks ago

@joshcarp what do you think about this implementation, Bpetokenizer, could I contribute to this? Or is this issue already addressed? Thanks.

joshcarp commented 3 weeks ago

Originally I wanted it to be a no-dependency project to be used as an example of every moving part of the transformer without having any external dependencies, however i've been way too distracted with other things such that I haven't come back to this.

If you want to raise a PR that removes all of the current tokenizer stuff and puts BPEtokenizer in that would be great, especially just to get the repo to a good working state.

I'm still going to do some BPE implementation in this repo for plans for a blogpost, similar go the annotated transformer, but it doesn't need to be fully feature complete.

Hk669 commented 3 weeks ago

that's great @joshcarp , let me raise a PR with the BPETokenizer( thinking to use the same GPT4 pattern to split the tokens if anyone wants to train their tokenizer ), thoughts on this?

joshcarp / llm.go

BPE tokenizer Implementation #2