Llama 3.2 - Githubissues

yieldthought commented 1 week ago

Bring up Llama 3.2 model family on Wormhole, T3K and TG

cglagovichTT commented 1 week ago

10/2 update:

Padding heads to 32 was broken. The logic did not pad one Q head per KV group, but instead added Q heads to the last groups, leading to PCC issues. I tried to figure out how to do this correctly but I couldn't (gpt can't either :) )
We can get away without padding heads at all. Prefill worked as-is, and decode required a small change to flash_decode to add support
Tested increasing precision of RMSNorm to bf16 - led to slightly lower PCC from (0.9999955311379768) to (0.9999952197607859) on decoder test. We might want to take another look at this with real inputs in a full model test, since I suspect that bf16 weights for RMSNorm are necessary.
Removed a tricky deallocate from LlamaAttention which caused low PCC
Used FP32 ACC in MLP and returned bf16 from FF2 -- minor PCC boost, might avoid problems later

What's next:

cglagovichTT commented 1 week ago

Llama3.2-11B-Vision bringup

Text model

Vision model

To run new tests, I need to figure out how to share llama-models changes. You also have to install some new packages.

pip install -r ../llama-models/requirements.txt

No issues

Has bias, uses GELU as activation. Only two linears.

Very similar to Attention, but does not generate a cache! It's MHA. Not a great shape, though: ImageAttention: dim=1280, head_dim=80, n_heads=16 Also requires an attention mask, which means we need to support non-causal attention in SDPA. Meta does something strange with qkvo replication which I don't understand https://github.com/meta-llama/llama-models/blob/main/models/llama3/reference_impl/multimodal/model.py#L254

tenstorrent / tt-metal