karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
8.69k stars 789 forks source link

Steal token visualisation code #11

Open hauntsaninja opened 4 months ago

hauntsaninja commented 4 months ago

Hello! I don't remember if I'd shown you https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py , but consider stealing the token visualisation code from here in some form: https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L186 I've found it can be quite useful to show people the intermediate steps :-) There's similar stuff in there for training

Screenshot 2024-02-17 at 5 36 39 PM
karpathy commented 4 months ago

Hey! Neat, I'll take a look :)

Btw I noticed that in this line:

https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L97

The way you're merging during inference is that you're checking the rank of the merged pair, and merging it if you find it. This is similar but not identical to checking the actual pair match - i.e. that the two children are what they should be and then merging up the tree if so. You're checking equality of the parent node, instead of equality of the child nodes.

It's not obvious that these two should be equivalent.

hauntsaninja commented 4 months ago

Yup, that is indeed a subtle thing about how tiktoken does things. I have a comment about this in the Rust code, but I should add something to the simple code too: https://github.com/openai/tiktoken/blob/1b9faf2779855124f05174adf1383e53689ed94b/src/lib.rs#L23

It's equivalent if your token index equals merge priority. You make the same inductive argument that this code makes: https://github.com/karpathy/minbpe/blob/master/minbpe/gpt4.py#L27 Sketch is something like prove BPE_i(bytes) = BPE_i'(bytes) where _i means running partial BPE up to merge priority i. You're worried about something like T_4 := T_1 ++ T_2 == T_0 ++ T_3 and BPE_4' ends up merging T_0, T_3 (when BPE_4 only merges T_1, T_2). But from inductive step BPE_2(T_4) == BPE_2'(T_4) == [T_1, T_2] so you can't actually have those bytes represented as T_0, T_3 when we do BPE_4'.

Like the comment says, it's possible to mess with things and break the equivalence. The most interesting way I think in which you would get different results is if you adapted this to do stochastic BPE.

That said, there isn't a strong reason tiktoken does things this way — it was just a little simpler and there's some connection to stuff I was toying with when I first wrote it.

asingh228 commented 4 months ago

Hello! I don't remember if I'd shown you https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py , but consider stealing the token visualisation code from here in some form: https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L186 I've found it can be quite useful to show people the intermediate steps :-) There's similar stuff in there for training

hey hauntsaninja, I was able to make the change to visualize the steps. I added a verbose flag to turn the visual on and off. encoded_text = tokenizer.encode(text, verbose=True)

Output Example

fyi: I'm not a professional developer (I have a CS and I'm learning AI/ML development via online lessons and ChatGPT; ChatGPT was used in modifying the code)

asingh228 commented 4 months ago

I would like your help in understand two things though:

  1. Why does this code behave so much differently than SimpleBytePairEncoding.from_tiktoken("cl100k_base").encode("supercalifragilistic") that was mentioned earlier? Is it because the training was done on Taylor Swift's wiki?
  2. (removed - was not a valid question)

thank you

Screenshot 2024-02-19 at 10 26 12 PM 305679227-b1636806-f5b4-4481-b825-ce59925272cd
hauntsaninja commented 4 months ago

Hello, glad you're having fun learning!

The output of your code kind of looks like you're training on just "supercalifragilistic", not encoding it. You'll want to train on a slightly larger corpus ;-) Once you've done that, use the trained encoding to encode single words like "supercalifragilistic".

Not sure what you mean "how many merges you decide to do" / where 3 merges comes from. BPE will merge until it can merge no more. (Unless you mean what size of vocabulary you target during training, in which case the answer to that is "whatever makes the language model we use this encoding in good", which can be a little tricky to determine)

There's a working version of the code in https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py , feel free to explore it.