meta-llama / llama-models

Utilities intended for use with Llama models.
Other
4.88k stars 838 forks source link

Incorrect token id for `<|image|>` token #219

Open vancoykendall opened 1 week ago

vancoykendall commented 1 week ago

In this repo the Llama3 tokenizer sets the <|image|> special token to 128011 https://github.com/meta-llama/llama-models/blob/ec6b56330258f6c544a6ca95c52a2aee09d8e3ca/models/llama3/api/tokenizer.py#L79-L101

However, in the tokenizer_config.json uploaded to the huggingface repo meta-llama/Llama-3.2-11B-Vision-Instruct, the <|image|> token is mapped to 128256.

image image

I also checked the norms of the model's embedding layer for tokens 128011 and 128256. 128011 has a norm near zero, while 128256 token has a regular norm. This makes me think 128256 is the correct token embedding for the <|image|> token. Screenshot 2024-11-13 at 6 24 30 PM

vancoykendall commented 1 week ago

Also if you download the weights from meta

llama model download --source meta --model-id Llama3.2-11B-Vision-Instruct

There's no embeddings for 128256 and up so it looks like its just missing the image token embedding all together.

I also checked the norms of the embedding tokens from the meta checkpoint here to confirm 128011 is an untrained embedding vector:

image
ashwinb commented 1 week ago

@vancoykendall you have identified a most terrible wart in the llama3 vision model.

See https://github.com/meta-llama/llama-models/blob/main/models/llama3/api/chat_format.py#L226

Essentially the <|image|> token does correspond to 128011 in the tokenizer ... however, the special token that actually got trained is the last token 128056. As to why that happened is a very esoteric reason in our training process.

cc @abhimanyudubey

vancoykendall commented 1 week ago

@ashwinb Gotcha, but shouldn't the downloaded checkpoint from meta contain the token embedding for the <|image|> token? The HF checkpoint has it as the 128256 embedding idx, but from what I can tell the meta download checkpoint just doesn't have it