Paper reproduce problem

duce3790 commented 6 months ago

https://github.com/huangd1999/AgentCoder/blob/650a44d670059e060506d80340cce12fcaf67d0d/programmer_mbpp.py#L104C1-L105C1

I have a few questions regarding the paper you shared. It introduces a three-agent framework, which includes the test executor agent. From what I gathered, the test executor agent executes tests, and the results are supposed to be sent back to the programmer for bug fixing. However, I couldn't find this process explicitly mentioned in the program.

To replicate the results outlined in the paper, I assume I should utilize the call_fix_bug() function in line 203 of test_executor_mbpp.py, followed by another call to call_fix_bug() in programmer_mbpp.py. However, in line 104 of programmer_mbpp.py, the code calls fetch_completion() instead of fix_bug(). Following this logic, it seems that the accuracy improvement is solely attributed to generating more code iterations rather than incorporating feedback for bug fixing.

huangd1999 commented 6 months ago

Hi duce3790,

Thanks for your reminder about the recall process in AgentCoder. Actually, we are still trying to find a more effective feedback strategy for buggy code, and that's why we provide several different buggy code optimization functions and call way (In our recent experiment, we found that directly call fetch_completion obtains 80.5% pass@1 while call_fix_bug() obtains 79.9% pass@1).

Following this logic, it seems that the accuracy improvement is solely attributed to generating more code iterations rather than incorporating feedback for bug fixing.

We suppose the improvement of fetch_completion() obtains 80.5% for two reasons. First, the test designer provided test cases that detected most of the buggy code from the LLM-generated code, e.g., the pass@1 of test-designer-generated test cases will increase from about 67% to 96% with iteration steps increase, while other codes will be then refined by the programmer. The second reason is that different from the call function call_fix_bug(), which will provide feedback with the test executor-reported information. In the second point, we observe that the information will introduce bias to the programmer, where the bias will predominately affect the code generation process of the programmer, e.g., the original correct + incorrect test cases feedback will cause the programmer to generate incorrect code that can pass incorrect test cases.

Since we observe that the bias will highly affect the re-generated code correctness, in practice, we prefer to directly call the fetch_completion function.

accuracy improvement is solely attributed to generating more code iterations

Our evaluation is that simply generating more code iterations may not improve the pass@1, the most important thing may be detecting incorrect code and then debugging it with multiple iterations.

I hope the above discussion can answer your questions about AgentCoder. In the future version, we will add this discussion into the paper(yep, it is still under development. @ @

huanhuan6666 commented 2 months ago

While I appreciate the practical considerations that may have led to this implementation choice, it would be helpful to have more clarity on why the approach differs from the paper. Additionally, some important aspects of the method remain ambiguous in both the paper and the code repository.

To improve transparency and reproducibility, it would be beneficial to align the implementation more closely with the paper's described methods, or alternatively, update the paper to reflect the current practical approach used in the code. This would help readers and potential users better understand the true methodology being employed.

huangd1999 commented 2 months ago

Thanks for your suggestion. I will add another branch to address this problem.

huangd1999 / AgentCoder

Paper reproduce problem #2