Open phimachine opened 3 months ago
Hi,
To check the original version of AgentCoder, we need go back to the previous version. The implementation of call_fix_bug
originally used in https://github.com/huangd1999/AgentCoder/blob/cdfce4ab074a09b0433fa95d980d0fcf11c4ccbe/test_executor_humaneval.py#L325C23-L325C35
Hello Authors,
I am a researcher reproducing your paper.
As others have mentioned (#2), the code base released here cannot reproduce the paper's results. The function
call_fix_bug
, for example, was defined but not used, meaning that the test designer was never used in the current repository, and you are simply resampling from the LLM with call_completion(), after running the canonical tests for every resampling epoch. At every epoch, if the synthesized code passes all tests, the code is kept. This is the same behavior in the defined but not-usedfix_bug()
function.If true, this would be a major issue with AgentCoder, as the code released here is inconsistent with the paper's claims: the ground truth is used for evaluation at every self-debug epoch, making the metric pass@k+1, not pass@1, where k is the number of resampling epochs, and +1 for the first code generation step. In the paper, you have reported all metrics to be pass@1. This is in addition to the issue that simple resampling is not the methods described in the paper.