Not reproducible? pass@k was used, not pass@1

Hello Authors,

I am a researcher reproducing your paper.

As others have mentioned (#2), the code base released here cannot reproduce the paper's results. The function call_fix_bug, for example, was defined but not used, meaning that the test designer was never used in the current repository, and you are simply resampling from the LLM with call_completion(), after running the canonical tests for every resampling epoch. At every epoch, if the synthesized code passes all tests, the code is kept. This is the same behavior in the defined but not-used fix_bug() function.

If true, this would be a major issue with AgentCoder, as the code released here is inconsistent with the paper's claims: the ground truth is used for evaluation at every self-debug epoch, making the metric pass@k+1, not pass@1, where k is the number of resampling epochs, and +1 for the first code generation step. In the paper, you have reported all metrics to be pass@1. This is in addition to the issue that simple resampling is not the methods described in the paper.

huangd1999 / AgentCoder

Not reproducible? pass@k was used, not pass@1 #8