Mixtral accuracy regression

mtairum commented 1 month ago

On recent commits we've noticed severe degradation on Mixtral accuracy. This can be observed on both the demo run (garbage output) or test_mixtral_decoder.py finishing with bad PCC.

At the moment we can safely confirm that the issue is not tied to the cached weights, as we've re-generated both the general and the instruct weights and confirmed passing on older commits.

To successfully close this issue we will also add a token comparison to our demo code: meaning that after Mixtral demo finishes, it's 32 user output will be compared to an expected output.

This will serve as an interim accuracy check until we add a proper or perplexity score or top1/5 accuracy check.

mtairum commented 1 month ago

Issue introduced in this commit: https://github.com/tenstorrent/tt-metal/commit/e12b6001dfa2b29118f5c2517075384829ffeb43

Likely we're seeing this in ttnn.eq(output_dtype=ttnn.bfloat16)

mtairum commented 1 month ago

Fix merged https://github.com/tenstorrent/tt-metal/pull/9509

tenstorrent / tt-metal

Mixtral accuracy regression #9480