karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
8.72k stars 794 forks source link

Faster BPE #5

Open zouharvi opened 4 months ago

zouharvi commented 4 months ago

Thanks for the nice repo! 🙂

Not sure if it's welcome here given that the goal of this repo is to be for educational purposes but you call the get_stats function, which is computationally heavy, V times, yielding O(|S| V) where V is the vocab size and |S| is the text size.

Huggingface tokenizers (tokenizers-rs) and other libraries use a live data structure (max heap with pointers) that makes it possible to not recompute the stats after each merge. The improved runtime is O(|S| log V).

It is described in Algorithm 2 here: https://aclanthology.org/2023.findings-acl.38.pdf

Here are the two algorithms: Screenshot_20240217-105550~2 Screenshot_20240217-105614~2

Btw it also has a mini version of BPE if anyone is interested in code-golfing BPE (would be cool to see the limits and hacks, I'm quite bad at it): Screenshot_20240217-105759~2

karpathy commented 4 months ago

Thank you for the pointers!

So I think we should retain the nice and clean and minimal and inefficient version as an algorithmic (and unit test) reference, but in addition to that we could have a version that is optimized for speed (e.g. bpe_regex_fast.py), and then probably even further lower that, e.g. to C/Rust/other.

So I'm not opposed to optimizations, we'd just keep that functionality separate. I'll either get around to this at some point, or I'd welcome PRs.

karpathy commented 4 months ago

One additional model I'm thinking about is what I did with llama2.c, where instead of reviewing tons of PRs for the root repo, it is treated more as a reference and people fork it to their own languages/repos optimized for different purposes, and then I'm happy to link to it from the Readme here. This path is advisable for 1) larger re-writes, 2) where people want to iterate fast themselves and 3) don't want me personally to become a PR bottleneck.

99991 commented 4 months ago

I implemented heap-based token merging for my simplified TinyLlama implementation. Feel free to copy what you need. Here is the function:

https://github.com/99991/SimpleTinyLlama/blob/main/tokenizer.py#L70

This function might be easier to understand after reading the code of the slow implementation first. It can be found below this function in a commend:

https://github.com/99991/SimpleTinyLlama/blob/9af6f7df6e12d8478a90d3cd5c8e8c1a95fce0fe/tokenizer.py#L136

gautierdag commented 4 months ago

I've actually implemented the above optimizations in Rust: https://github.com/gautierdag/bpeasy/tree/main

It mostly follows what's done in the Huggingface's tokenizers library but in 400 lines, so might be an easier guide for anyone trying to port BPE optimizations.

paplorinc commented 4 months ago

I haven't seen the papers, seems I've reinvented the linearithmic byte pair merge which I've contributed back to tiktoken and jtokkit, see:

JohannesVod commented 4 months ago

I think i found a small bug in your algorithm. For example, if the input sequence is 1,2,1,2,1,2 then first the pair (1,2) is merged to lets say token 3. When merging into token 3 we actually call AddPosition(3,3) twice at two different positions, but it is only possible to substitute the pair (3,3) once. In a later iteration this will throw an error because we try to access a deleted node. (Depending on the implementation of AddPosition). Screenshot 2024-03-05 163536

JohannesVod commented 3 months ago

I used your algorithm and implemented it in c++/ctypes to speed up the Regex tokenizer by a lot. See https://github.com/NiceGuySaysHi/QuickBPE Right now it takes ~60 seconds on my 12th Gen Intel(R) Core(TM) i5-12400F 2.50 GHz to train on100mb of text. To tokenize 100mb it takes ~20 seconds, however most time is spend on the regex splitting, the c++ call itself only takes ~3 seconds

paplorinc commented 3 months ago

You can avoid the regex completely, see my parser for JTokkit: https://github.com/knuddelsgmbh/jtokkit/pull/77/files#diff-4836953d74c2e15ea9fb4c739ec54dfa0709682d04c237e46f7ca9af6630cdaa

JohannesVod commented 3 months ago

does this give better compression rates? @paplorinc

paplorinc commented 3 months ago

No, it's just magnitudes faster than the implementation here, but produces the exact same output as tiktoken

JohannesVod commented 3 months ago

Ah, i see. At some point i had a similar implementation as well. I think i dropped it because it was too memory intensive. I will take a look at it again, maybe i can improve it.