huangd1999 / AgentCoder

This Repo is the official implementation of AgentCoder and AgentCoder+.
187 stars 35 forks source link

Not reproducible? pass@k was used, not pass@1 #8

Open phimachine opened 3 weeks ago

phimachine commented 3 weeks ago

Hello Authors,

I am a researcher reproducing your paper.

As others have mentioned (#2), the code base released here cannot reproduce the paper's results. The function call_fix_bug, for example, was defined but not used, meaning that the test designer was never used in the current repository, and you are simply resampling from the LLM with call_completion(), after running the canonical tests for every resampling epoch. At every epoch, if the synthesized code passes all tests, the code is kept. This is the same behavior in the defined but not-used fix_bug() function.

If true, this would be a major issue with AgentCoder, as the code released here is inconsistent with the paper's claims: the ground truth is used for evaluation at every self-debug epoch, making the metric pass@k+1, not pass@1, where k is the number of resampling epochs, and +1 for the first code generation step. In the paper, you have reported all metrics to be pass@1. This is in addition to the issue that simple resampling is not the methods described in the paper.

huangd1999 commented 3 weeks ago

Hi,

To check the original version of AgentCoder, we need go back to the previous version. The implementation of call_fix_bug originally used in https://github.com/huangd1999/AgentCoder/blob/cdfce4ab074a09b0433fa95d980d0fcf11c4ccbe/test_executor_humaneval.py#L325C23-L325C35