Closed QiushiSun closed 1 year ago
Hi @QiushiSun , Thank you for your interest in our work and for your kind words!
I checked and the best results reported in the paper for text-davinci-003
on GSM8K were performed with --max_tokens=512
.
Sorry for the confusion, I'll update it in the script.
(There still might be minor inconsistencies arising from the inconsistent outputs coming from OpenAI's models, but these are usually lead to only ~2-3 points difference)
(Any additional thoughts @madaan ?)
Please let us know if you manage to reproduce the results now and if you have any other questions.
Also, please note that querying text-davinci-003
is 10 times more expensive than gpt-3.5-turbo
(ChatGPT) which our code supports as well.
Best, Uri
Got it! Thank you for your prompt reply and reminders👍
Thanks for releasing this amazing repo. I have some minor concerns about the performance of PAL reported in the article on LMs of natural language, specifically text-davinci-003. The paper reported an accuracy of 69.8 on the GSM8K dataset (Appendix C), but my reproduced results showed an accuracy of 0.5936315390447309.
I reproduced the results by directly modifying the parameters in gsm_eval.py.
parser.add_argument('--model', default='text-davinci-003', type=str)
Are there any additional settings needed?
Deep thanks!