Results of using PAL on LMs of natural language

QiushiSun commented 1 year ago

Thanks for releasing this amazing repo. I have some minor concerns about the performance of PAL reported in the article on LMs of natural language, specifically text-davinci-003. The paper reported an accuracy of 69.8 on the GSM8K dataset (Appendix C), but my reproduced results showed an accuracy of 0.5936315390447309.

I reproduced the results by directly modifying the parameters in gsm_eval.py.

parser.add_argument('--model', default='text-davinci-003', type=str)

Are there any additional settings needed?

Deep thanks!

urialon commented 1 year ago

Hi @QiushiSun , Thank you for your interest in our work and for your kind words!

I checked and the best results reported in the paper for text-davinci-003 on GSM8K were performed with --max_tokens=512.

Sorry for the confusion, I'll update it in the script.

(There still might be minor inconsistencies arising from the inconsistent outputs coming from OpenAI's models, but these are usually lead to only ~2-3 points difference)

(Any additional thoughts @madaan ?)

Please let us know if you manage to reproduce the results now and if you have any other questions.

Also, please note that querying text-davinci-003 is 10 times more expensive than gpt-3.5-turbo (ChatGPT) which our code supports as well.

Best, Uri

QiushiSun commented 1 year ago

Got it! Thank you for your prompt reply and reminders👍

reasoning-machines / pal

Results of using PAL on LMs of natural language #16