On recent commits we've noticed severe degradation on Mixtral accuracy. This can be observed on both the demo run (garbage output) or test_mixtral_decoder.py finishing with bad PCC.
At the moment we can safely confirm that the issue is not tied to the cached weights, as we've re-generated both the general and the instruct weights and confirmed passing on older commits.
To successfully close this issue we will also add a token comparison to our demo code: meaning that after Mixtral demo finishes, it's 32 user output will be compared to an expected output.
This will serve as an interim accuracy check until we add a proper or perplexity score or top1/5 accuracy check.
On recent commits we've noticed severe degradation on Mixtral accuracy. This can be observed on both the demo run (garbage output) or
test_mixtral_decoder.py
finishing with bad PCC.At the moment we can safely confirm that the issue is not tied to the cached weights, as we've re-generated both the general and the instruct weights and confirmed passing on older commits.
To successfully close this issue we will also add a token comparison to our demo code: meaning that after Mixtral demo finishes, it's 32 user output will be compared to an expected output.
This will serve as an interim accuracy check until we add a proper or perplexity score or top1/5 accuracy check.