Closed YcChou closed 1 month ago
Because we were using fintuned llama2-7b for evaluation. Now we have changed the code base and new version of paper will be uploaded soon. Now we provide llama3-8b and llama2-13b evaluation on GSM9K, AQUA and BBH dataset.
Thanks for your great work!
When I evaluate the zero-shot capability of the llama2-7b-chat model on gsm8k, using your code did not achieve the fp16 baseline results in your paper, 7.7 (reproduced) vs 19.8 (paper). May I ask what your bash script parameters are? Mine are as follows:
python evaluation_gsm8k.py \ --model 'llama2-7b-chat' \ --batch_size 6 \ --max_new_tokens 256 \ --zero_shot \
Thx!