Was able to use this implementation to benchmark on MMLU.
The scores were significantly lower..
Was wondering if you had any thoughts or suggestion about the evaluation script I used (see eval_mixtral)
scored 63.9% which is higher than mistral 7B but lower than the Mixtral score of 70.9
Was able to use this implementation to benchmark on MMLU.
The scores were significantly lower..
Was wondering if you had any thoughts or suggestion about the evaluation script I used (see eval_mixtral) scored 63.9% which is higher than mistral 7B but lower than the Mixtral score of 70.9
https://github.com/bdytx5/mixtral8x7b_MMLU/blob/main/test/MixtralKit/tools/eval_mmlu.py
Thanks, Brett