Llama3 tokenizer missing token ID 128011

SalmanMohammadi commented 1 week ago

I'm getting an error when decoding with Llama3 tokenizer - we seem to be missing special token ID 128011 - it's defined here for the Llama3.1 tokenizer, and here for the Llama3.2 vision tokenizer as reserved_special_token_3and reserved_special_token_2, respectively, but when we setup the reserved tokens in our llama3 tokeniser, we skip 128011. Should we be adding this token ID into the reserved special tokens list?

from torchtune.models.llama3 import llama3_tokenizer

tk = llama3_tokenizer("/home/salman/Downloads/tokenizer.model")
tk.decode([128011], skip_special_tokens=True)

thread '<unnamed>' panicked at src/lib.rs:179:64:
no entry found for key
PanicException                            Traceback (most recent call last)
Cell In[1], [line 4](vscode-notebook-cell:?execution_count=1&line=4)
      [1](vscode-notebook-cell:?execution_count=1&line=1) from torchtune.models.llama3 import llama3_tokenizer
      [3](vscode-notebook-cell:?execution_count=1&line=3) tk = llama3_tokenizer("/home/salman/Downloads/tokenizer.model")
----> [4](vscode-notebook-cell:?execution_count=1&line=4) tk.decode([128011], skip_special_tokens=True)

File ~/torchtune/torchtune/models/llama3/_tokenizer.py:189, in Llama3Tokenizer.decode(self, token_ids, truncate_at_eos, skip_special_tokens)
    [172](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:172) """
    [173](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:173) Decode a list of token ids into a string.
    [174](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:174) 
   (...)
    [183](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:183)     str: The decoded string.
    [184](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:184) """
    [185](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:185) # We will remove special tokens manually via regex on the decoded string.
    [186](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:186) # This is because removing all special tokens does not remove the role and
    [187](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:187) # whitespace added from the special tokens, i.e., the "user" and "\n\n" in
    [188](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:188) # "<|start_header_id|>user<|end_header_id|>\n\n"
--> [189](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:189) decoded_string = self.tt_model.decode(
    [190](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:190)     token_ids=token_ids,
    [191](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:191)     truncate_at_eos=truncate_at_eos,
    [192](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:192) )
    [193](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:193) return (
    [194](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:194)     self._remove_special_tokens(decoded_string)
    [195](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:195)     if skip_special_tokens
    [196](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:196)     else decoded_string
    [197](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/models/llama3/_tokenizer.py:197) )

File ~/torchtune/torchtune/modules/tokenizers/_tiktoken.py:160, in TikTokenBaseTokenizer.decode(self, token_ids, truncate_at_eos)
    [158](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/modules/tokenizers/_tiktoken.py:158)     if k:
    [159](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/modules/tokenizers/_tiktoken.py:159)         token_ids = token_ids[:k]
--> [160](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/torchtune/torchtune/modules/tokenizers/_tiktoken.py:160) return self.tt_model.decode(token_ids)

File ~/.pyenv/versions/3.11.9/envs/tune/lib/python3.11/site-packages/tiktoken/core.py:258, in Encoding.decode(self, tokens, errors)
    [246](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/.pyenv/versions/3.11.9/envs/tune/lib/python3.11/site-packages/tiktoken/core.py:246) def decode(self, tokens: list[int], errors: str = "replace") -> str:
    [247](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/.pyenv/versions/3.11.9/envs/tune/lib/python3.11/site-packages/tiktoken/core.py:247)     """Decodes a list of tokens into a string.
    [248](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/.pyenv/versions/3.11.9/envs/tune/lib/python3.11/site-packages/tiktoken/core.py:248) 
    [249](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/.pyenv/versions/3.11.9/envs/tune/lib/python3.11/site-packages/tiktoken/core.py:249)     WARNING: the default behaviour of this function is lossy, since decoded bytes are not
   (...)
    [256](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/.pyenv/versions/3.11.9/envs/tune/lib/python3.11/site-packages/tiktoken/core.py:256)     ```
    [257](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/.pyenv/versions/3.11.9/envs/tune/lib/python3.11/site-packages/tiktoken/core.py:257)     """
--> [258](https://file+.vscode-resource.vscode-cdn.net/home/salman/torchtune/target/~/.pyenv/versions/3.11.9/envs/tune/lib/python3.11/site-packages/tiktoken/core.py:258)     return self._core_bpe.decode_bytes(tokens).decode("utf-8", errors=errors)

PanicException: no entry found for key

This occurs for the Llama3, Llama3.1, and Llama3.2 tokenizers. cc @RdoubleA

vancoykendall commented 1 week ago

There seems to be something strange with the <|image|> token's token_id. Based on this repo and huggingface it should be 128256. However, it would make more sense for it to be 128011 based on the code in the tokenizer file in this repo and in the https://github.com/meta-llama/llama-models repo, which is causing the error in this repo:

https://github.com/pytorch/torchtune/blob/4b6877a6ef31a1f987c27594eaf8fe467b5ab785/torchtune/models/llama3/_tokenizer.py#L28-L38

I raised a related issue here: https://github.com/meta-llama/llama-models/issues/219

RdoubleA commented 1 week ago

The token id for image should be 128256 for torchtune, as that is what is used by Hugging Face where our users download the models. This was the token id that was used when initially training the model on images, but for inference the embedding was actually moved to 128011 (which was a reserved token before) so users wouldn't have to keep the embedding vectors for both 128011 and 128256 and can save on compute/memory during inference time.

This may explain why you see 128011 being use when you download from meta directly using llama model download @vancoykendall, as the llama repos are optimized for inference and not post-training. But if you were to run fine-tuning in torchtune using the model downloaded from HF, you should use 128256, which is what our tokenizers expect.

@SalmanMohammadi, since you are explicitly calling decode on 128011 you are seeing this error. But do you find this token id in a real use case? I would imagine the tokenizer would never output 128011 since it is not used for anything, and so you would never have to decode this. Still, even if we assume you can run into a wild 128011, I would imagine you should still get a random embedding vector since the embedding table is continuous. Perhaps tiktoken prevents you from doing this if the token id is not registered as a special token and it's not in the regular vocab.

It may be possible that our reserved token logic is incorrect and should create a reserved token for 128011. I'll take a closer look at this and the HF tokenizer config to see if that's the case.

Tagging some folks who have more knowledge on this topic to clarify anything I may have missed, @pbontrager @abhimanyudubey

SalmanMohammadi commented 1 week ago

But do you find this token id in a real use case?

If you set limit=1 on any of the vision eleuther eval recipe tests it'll come up - this is how I found it.

SalmanMohammadi commented 1 week ago

FWIW the HF config for the 3.2 vision tokenizer reserves token 128011 as a special token.

vancoykendall commented 1 week ago

@RdoubleA The problem with 128011 being used in the llama model download is that the 128011 token embedding doesn't match the 128256 token embedding from the HF model. So it seems like the meta checkpoint didn't actually change the 128011 token embedding. I downloaded the llama3.2 vision 11b checkpoint from meta last night and checked it here.

RdoubleA commented 1 week ago

Confirmed with @ashwinb on the llama-models repo that 128256 is the correct ID and 128011 should not be used, due to reasons explained in the linked issue. I am not sure what HF is doing on their end, but 128011 should not be a reserved token nor should it ever come up in the wild, as the embedding for it is not trained. So @SalmanMohammadi's original error is indeed expected behavior. As to why the eleuther eval recipe test produced a 128011 when using our own tokenizer is a separate question and needs to be debugged.

To provide clarity to future users that may have similar questions, I can add some comments in the code to explain why 128011 is skipped.

SalmanMohammadi commented 1 week ago

Thanks @vancoykendall for helping look into this and @RdoubleA for the clear explanation. I think this occured because we use an untrained model with dummy weights in some of our recipe tests which generates this token. Closing this for now as this is a separate issue.

pytorch / torchtune

Llama3 tokenizer missing token ID 128011 #1995