karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
9.08k stars 839 forks source link

Much faster Regex tokenization using c++ and ctypes #65

Open JohannesVod opened 5 months ago

JohannesVod commented 5 months ago

i implemented a much faster training/tokenization algorithm in c++ and called the functions using ctypes, so that they can be used conveniently in python. The performance gain is huge, i was able to tokenize ~3.5Gb of text in about 8 minutes. The main bottleneck is now the regex splitting, which is hard to optimize since i decided to keep it integrated into python (so that it is still easy to change the split pattern). The training algorithm i used is from https://arxiv.org/abs/2306.16837, which mentions a running time of O(nlog(m)). This is an overestimation. The true running time is O(n + mlog(m)) i think, which is linear in the sequence length in practice. The training took about 2 minutes on ~100mb of text, which seems to be decent. There is probably still a lot of improvement that can be done. Also the encode function is much slower than the encode_ordinary function if the special tokens are distributed evenly because of the splitting. This still needs to be fixed

karpathy commented 5 months ago

Hi, this looks awesome. Maybe make a separate repo and I'm very happy to link to it from this code base in the main Readme under extensions?

JohannesVod commented 5 months ago

@karpathy alright, you can link this one if you want to: https://github.com/JohannesVod/FastBPE