Open xihuai18 opened 6 months ago
HumanEval: Lines 226-229 are implemented in the function test_agent_concurrency, which is used in the completion lists + test lists. So I think the canonical solution does not leak in this function.
MBPP:
As shown in https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/programmer_mbpp.py#L38 and google mbpp prompt recommendation (https://github.com/google-research/google-research/tree/master/mbpp):You are an expert Python programmer, and here is your task: {prompt} Your code should pass these tests:\n\n{tests}\n[BEGIN]\n{code}\n[DONE]
You can also see Fig. 1 in https://arxiv.org/pdf/2108.07732. MBPP's tests are available for LLMs.
By the way, we have updated the implementation for AgentCoder, you can pull it to obtain more readable source codes.
That's true, but I don't think the author handled it properly, only the function name needs to appear in the prompt (e.g., Write a function to check if the given number is woodball or not. The beginning of the generated content is as follows:def is_woodall(x)), or set “entry_point” as Humaneval does.
https://github.com/noahshinn/reflexion/blob/main/programming_runs/benchmarks/mbpp-py.jsonl The prompts here use the approach I've described
Dear Qlalq,
Thanks for your response and URL reference. I have checked the MBPP-Py file, seems like it is very similar to HumanEval.
I am not sure the meaning of ``but I don't think the author handled it properly''. Do you mean the AgentCoder should use the MBPP-Py file to test the MBPP's effectiveness (Since currently we are following the DeepMind's script) or DeepMind script can be changed with the HumanEval prompt template with the MBPP-Py file?
I think both are fine, since the point is not to leak the test set, and giving the function name is sufficient
Oh, thanks for your response. I understand now.
Hi! I also found label leaking problems in programmer_mbpp.py. It is the generation of code, because the prompt in "./prompts/mbpp_prompt_update.txt" is prompting the gpt to generate data with description of requirement and test cases. Sometimes gpt will responds with the test case in the code generation. When this generated code pass to the test_designer, it will cause test case leakage. This figure is the prompt for code generation.
This figure is the response of gpt, you could see at the end of the red box, test case will be appended on the generated code. But as I go through the whole files, there is no way to avoid this.
It is easy to tackle, just a reminder that providing test cases directly would cause test case leakage. Thus it is unfair to compare with other methods. :)
Hi, you can check the discussion above about the prompt format of MBPP, it contains dataset provided tests (not EvalPlus) in the code generation :
google mbpp prompt recommendation (https://github.com/google-research/google-research/tree/master/mbpp):`:%60) You are an expert Python programmer, and here is your task: {prompt} Your code should pass these tests:\n\n{tests}\n[BEGIN]\n{code}\n[DONE]` You can also see Fig. 1 in https://arxiv.org/pdf/2108.07732. MBPP's tests are available for LLMs.
I think it is a good way: It is easy to tackle, just a reminder that providing test cases directly would cause test case leakage.
Recommend construct a PR to update it.
Thanks for your reply. I understood that this problem is due to the dataset because the data generation and data testing are based on the same test case.
In addition, I also have a problem regarding the test executor. In test_executor_mbpp.py, there is no procedure to utilize the generated test sample to validate these code snippets. This part is different from the procedure in the paper. Is this part we should modify? E.g. change either function test_agent with generated test samples.
Thanks. :)
Hi, I have checked the functions in the current version (maybe due to the brainstreams in my previous version), which then forgot to use the tests generated by test_designer.
My previous consideration is that I want to evaluate the coverage and bug detection of tests for MBPP generated by test designers and dataset-provided public test cases. I will try to revise the current code and add more discussion about this part of MBPP in the paper(but currently busy on another project.
Understood! Thanks for replying. :)
It seems that the code uses statistics from performing the tests in the dataset, would there be data / label leaking?
solutions leaking https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/test_executor_humaneval.py#L226-L229
https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/programmer_humaneval.py#L33
test result leaking https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/test_executor_mbpp.py#L144-L149
https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/programmer_mbpp.py#L30