reasoning-machines / pal

PaL: Program-Aided Language Models (ICML 2023)
https://reasonwithpal.com
Apache License 2.0
462 stars 58 forks source link

Performance on GSM8k got 69.9 #8

Closed allanj closed 1 year ago

allanj commented 1 year ago

Simply run the script and the accuracy I got is 69.9, but the one reported in the paper is 72.

Wondering what would be the potential difference, notice that the temperature is 0. I guess the might not have a large degree of randomness?

madaan commented 1 year ago

Thanks for your interest in our work! Which script did you run (if you could share the exact usage/parameters, that'd be useful)? Is is possible for you to share the output file? We've observed some randomness even with temperature = 0., perhaps due to API calls being routed to different models in the backend.

allanj commented 1 year ago

gsm8k_program_aided_result.json.zip

I slightly modify the interface to output the code_snippet.

Sure. I simply try this one: https://github.com/reasoning-machines/pal/blob/main/scripts/gsm_eval.py

allanj commented 1 year ago

That's an interesting observation. Do you know where I can find more documentation about their backend?

zhaoyiran924 commented 1 year ago

Simply run the script and the accuracy I got is 69.9, but the one reported in the paper is 72.

Wondering what would be the potential difference, notice that the temperature is 0. I guess the might not have a large degree of randomness?

Same here, I also tried the script of math several times, and got 70.3% on GSM (containing 1319 datapints) and 68.86% on the GSM8K (containing 7473 datapoints) at best. May I know whether you use the majority voting or temperature large than 0 when you get 72% on the GSM dataset? Thanks so much for your help.

madaan commented 1 year ago

Thanks both!

I reproduced the numbers today with prompt-lib. Here are the commands:

Running the job

nohup python3 -u prompt-lib/prompt_lib/run_inference.py --task_id gsm_pal --num_prompt_examples -1 --name gsm_pal_code-davinci-002_s1 --model_name code-davinci-002 --max_tokens 600 --max_requests_per_min 16 --seed 0 --num_questions_per_thread 500 --temperature 0.0  --cot_task

Evaluation

python prompt-lib/prompt_lib/eval/eval.py --path "data/logs/gsm_pal/code-davinci-002/temp_0.0/seed_0/k_all/2022-12-08_06-29-38/outputs.jsonl" --type code

Output:

Accuracy: 72.02

Going forward, prompt-lib will be our standard method of doing bulk inference. It supports a number of other capabilities, including wandb logging and multiple decodings, so definitely recommend checking it out.

We will investigate reasons for the regression, but I know that Uri was able to get these cool results with current scripts. Also, it is important to note that the numbers in the papers were reported after averaging over three random seeds, so the exact numbers for each run will not exactly match (but should be close).

@zhaoyiran924 We only report numbers for single decoding. The results with multiple samples are here and will be reported in the next version (TLDR: we get 80.4 @40 samples).

allanj commented 1 year ago

Thanks a lot for the experiments. Just a quick question before I jump into the new repo: what's actually the difference between the experiments we run and the one you provided? (Except for the max token length.)

madaan commented 1 year ago

@allanj No difference ideally! They both use the same prompt, same number of examples, and same model. The regression might be due to error in post-processing the results or the eval script underestimating the results. With that said, we will take another look and report on this issue if we find anything. Just a heads up--our updates might be slow due to EMNLP.

urialon commented 1 year ago

Hi all, I got 71.4 using this repo, about a week ago. (I am a co-author of this paper, but I was not involved in the code until a week ago)

@allanj and @zhaoyiran924 - did you notice any errors in the stout logs? I noticed that sometimes the OpenAI API is temporarily unavailable and returns errors that make the results slightly worse. (We do have an exponential backoff retry mechanism, but if the API is unavailable for long minutes, we sometimes just reach the overall number of retries and just mark that example as incorrectly predicted)

allanj commented 1 year ago

Yeah, that could be some randomness. Anyway, I think the variance around 70-72 is reasonable probably due to openAI API.

shuyanzhou commented 1 year ago

Hi @allanj, @zhaoyiran924 I also re-run the experiment on gsm(1319 examples) with --max_token 600. I got an accuracy of 71.26, similar to @urialon's result. And I agree that the small gaps might due to the randomness in the API calls.

allanj commented 1 year ago

Thanks all for your efforts. I will close the issue.