Closed tangzhy closed 11 months ago
I also got same problem/
Thank you for your attention in our work,the MPT, Falcon, Llama-1 and Llama-2 socres are retrieved from the paper of LLaMA 2 https://arxiv.org/abs/2307.09288. The inference scripts are used to evaluate our WizardMath models
Hi! thanks for amazing WizardMath work.
As what you've shown in the README of WizardMATH:
For Llama-2-7b, you reported that GSM=14.6 and MATH=2.5.
However, when I try to run the results using your inference scripts:
python inference/gsm8k_inference.py --data_file data/gsm8k_test.jsonl --model "path_to_llama2_7b" --batch_size 60 --tensor_parallel_size 4
and
python inference/MATH_inference.py --data_file data/MATH_test.jsonl --model "path_to_llama2_7bt" --batch_size 50 --tensor_parallel_size 4
.I get extremely low performance for Llama-2-7b, where GSM=3 and MATH=0.
I wonder do your reported performance of Llama-2-7b are run by the inference scripts your provided as well? or you just copy the results from other places?
Can you explain it a bit more?