Closed VijayLingam95 closed 1 month ago
Hi, thanks for pointing this out.
This is indeed a bug we noticed and fixed for GPT-4 but did not fix for GPT-3.5. We will reevaluate the results for this model and update the paper and repository.
For this particular set of solutions, it was the same GPT-3.5 model that generated the synthetic tests. This was made to be more flexible in the code later on.
Dear Authors,
I attempted to reproduce the results on the programming task from your paper. However, I encountered a critical issue in the
programming/mcts.py
file: thenum_success
counter is incorrectly incremented without verifying if the generated solution passed the actual test. For your reference, I have included the relevant code block below:https://github.com/lapisrocks/LanguageAgentTreeSearch/blob/43901ce2fd21e3e3dda115f440a8ab75acab3574/programming/mcts.py#L134-L144
Specifically,
num_success += 1
should be replaced withnum_success += int(is_passing)
. After running your code (withmax_iters=8
andnumber_of_tests=4
) using the GPT-3.5-Turbo model, I noticed from the logs that 21 incorrect solutions were contributing to the accuracy metric. After fixing this bug, the accuracy on HumanEval dropped from 86.95% (terminal output:completed 161/161: acc = 0.87
, ran on 161 HumanEval problems from Reflexion) to 73.91%.Additionally, from the commit history, it appears that the GPT-4 model was used to generate synthetic tests. Could you please confirm if this is the case?