Thanks for the paper and open-sourcing your evaluation code.
The Mixtral evaluation numbers looked a bit strange to me, so I checked which model was used for evaluation as this was not reported in the paper (base/instruction-tuned/etc.). If I understand correctly based on this line, it seems this model was used for evaluation: https://huggingface.co/DiscoResearch/DiscoLM-mixtral-8x7b-v2.
Please note that this is a highly experimental model that was published before Mistral's official release, which was also fine-tuned on some random datasets. This is also indicated in the model card:
This model is still an early Alpha with experimental code and we can't guarantee that there all values are correct.
That way, one should normally obtain evaluation scores which are in line with what has been reported here: https://github.com/open-compass/MixtralKit. For example, they report 67.1 on BigBench-Hard compared to 41.76 in your paper. Note that this is a higher number than Gemini-Pro, which scores 65.58 according to your paper. They also report 65.7 on BigBench-Hard compared to 58.45 in your paper, etc.
Hi folks,
Thanks for the paper and open-sourcing your evaluation code.
The Mixtral evaluation numbers looked a bit strange to me, so I checked which model was used for evaluation as this was not reported in the paper (base/instruction-tuned/etc.). If I understand correctly based on this line, it seems this model was used for evaluation: https://huggingface.co/DiscoResearch/DiscoLM-mixtral-8x7b-v2.
Please note that this is a highly experimental model that was published before Mistral's official release, which was also fine-tuned on some random datasets. This is also indicated in the model card:
It would be great if you could reevaluate the Mixtral model with the official instruction-tuned version: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1. This model is also available on Together's API: https://docs.together.ai/docs/inference-models.
That way, one should normally obtain evaluation scores which are in line with what has been reported here: https://github.com/open-compass/MixtralKit. For example, they report 67.1 on BigBench-Hard compared to 41.76 in your paper. Note that this is a higher number than Gemini-Pro, which scores 65.58 according to your paper. They also report 65.7 on BigBench-Hard compared to 58.45 in your paper, etc.
Thanks!
Kind regards,
Niels