Open vancoykendall opened 1 week ago
Also if you download the weights from meta
llama model download --source meta --model-id Llama3.2-11B-Vision-Instruct
There's no embeddings for 128256
and up so it looks like its just missing the image token embedding all together.
I also checked the norms of the embedding tokens from the meta checkpoint here to confirm 128011 is an untrained embedding vector:
@vancoykendall you have identified a most terrible wart in the llama3 vision model.
See https://github.com/meta-llama/llama-models/blob/main/models/llama3/api/chat_format.py#L226
Essentially the <|image|> token does correspond to 128011 in the tokenizer ... however, the special token that actually got trained is the last token 128056. As to why that happened is a very esoteric reason in our training process.
cc @abhimanyudubey
@ashwinb Gotcha, but shouldn't the downloaded checkpoint from meta contain the token embedding for the <|image|>
token? The HF checkpoint has it as the 128256 embedding idx, but from what I can tell the meta download checkpoint just doesn't have it
In this repo the Llama3 tokenizer sets the
<|image|>
special token to128011
https://github.com/meta-llama/llama-models/blob/ec6b56330258f6c544a6ca95c52a2aee09d8e3ca/models/llama3/api/tokenizer.py#L79-L101However, in the tokenizer_config.json uploaded to the huggingface repo
meta-llama/Llama-3.2-11B-Vision-Instruct
, the<|image|>
token is mapped to128256
.I also checked the norms of the model's embedding layer for tokens
128011
and128256
.128011
has a norm near zero, while128256
token has a regular norm. This makes me think128256
is the correct token embedding for the<|image|>
token.