nlpxucan / WizardLM

LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath
9.11k stars 711 forks source link

Your report of Llama-2-7b's performance on GSM and MATH #166

Closed tangzhy closed 11 months ago

tangzhy commented 11 months ago

Hi! thanks for amazing WizardMath work.

As what you've shown in the README of WizardMATH:

Model GSM8k Pass@1 MATH Pass@1
MPT-7B 6.8 3.0
Falcon-7B 6.8 2.3
LLaMA-1-7B 11.0 2.9
LLaMA-2-7B 14.6 2.5

For Llama-2-7b, you reported that GSM=14.6 and MATH=2.5.

However, when I try to run the results using your inference scripts:

python inference/gsm8k_inference.py --data_file data/gsm8k_test.jsonl --model "path_to_llama2_7b" --batch_size 60 --tensor_parallel_size 4

and

python inference/MATH_inference.py --data_file data/MATH_test.jsonl --model "path_to_llama2_7bt" --batch_size 50 --tensor_parallel_size 4.

I get extremely low performance for Llama-2-7b, where GSM=3 and MATH=0.

I wonder do your reported performance of Llama-2-7b are run by the inference scripts your provided as well? or you just copy the results from other places?

Can you explain it a bit more?

SeungyounShin commented 11 months ago

I also got same problem/

flyinghpluo commented 11 months ago

Thank you for your attention in our work,the MPT, Falcon, Llama-1 and Llama-2 socres are retrieved from the paper of LLaMA 2 https://arxiv.org/abs/2307.09288. The inference scripts are used to evaluate our WizardMath models