Your report of Llama-2-7b's performance on GSM and MATH

tangzhy commented 11 months ago

Hi! thanks for amazing WizardMath work.

As what you've shown in the README of WizardMATH:

Model	GSM8k Pass@1	MATH Pass@1
MPT-7B	6.8	3.0
Falcon-7B	6.8	2.3
LLaMA-1-7B	11.0	2.9
LLaMA-2-7B	14.6	2.5

For Llama-2-7b, you reported that GSM=14.6 and MATH=2.5.

However, when I try to run the results using your inference scripts:

python inference/gsm8k_inference.py --data_file data/gsm8k_test.jsonl --model "path_to_llama2_7b" --batch_size 60 --tensor_parallel_size 4

and

python inference/MATH_inference.py --data_file data/MATH_test.jsonl --model "path_to_llama2_7bt" --batch_size 50 --tensor_parallel_size 4.

I get extremely low performance for Llama-2-7b, where GSM=3 and MATH=0.

I wonder do your reported performance of Llama-2-7b are run by the inference scripts your provided as well? or you just copy the results from other places?

Can you explain it a bit more?

SeungyounShin commented 11 months ago

I also got same problem/

flyinghpluo commented 11 months ago

Thank you for your attention in our work，the MPT, Falcon, Llama-1 and Llama-2 socres are retrieved from the paper of LLaMA 2 https://arxiv.org/abs/2307.09288. The inference scripts are used to evaluate our WizardMath models

nlpxucan / WizardLM

Your report of Llama-2-7b's performance on GSM and MATH #166