huangd1999 / AgentCoder

This Repo is the official implementation of AgentCoder and AgentCoder+.
237 stars 50 forks source link

Label Leaking? #3

Open xihuai18 opened 6 months ago

xihuai18 commented 6 months ago

It seems that the code uses statistics from performing the tests in the dataset, would there be data / label leaking?

solutions leaking https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/test_executor_humaneval.py#L226-L229

https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/programmer_humaneval.py#L33

test result leaking https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/test_executor_mbpp.py#L144-L149

https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/programmer_mbpp.py#L30

huangd1999 commented 6 months ago

HumanEval: Lines 226-229 are implemented in the function test_agent_concurrency, which is used in the completion lists + test lists. So I think the canonical solution does not leak in this function.

MBPP: As shown in https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/programmer_mbpp.py#L38 and google mbpp prompt recommendation (https://github.com/google-research/google-research/tree/master/mbpp):You are an expert Python programmer, and here is your task: {prompt} Your code should pass these tests:\n\n{tests}\n[BEGIN]\n{code}\n[DONE] You can also see Fig. 1 in https://arxiv.org/pdf/2108.07732. MBPP's tests are available for LLMs.

By the way, we have updated the implementation for AgentCoder, you can pull it to obtain more readable source codes.

Qlalq commented 5 months ago

That's true, but I don't think the author handled it properly, only the function name needs to appear in the prompt (e.g., Write a function to check if the given number is woodball or not. The beginning of the generated content is as follows:def is_woodall(x)), or set “entry_point” as Humaneval does.

Qlalq commented 5 months ago

https://github.com/noahshinn/reflexion/blob/main/programming_runs/benchmarks/mbpp-py.jsonl The prompts here use the approach I've described

huangd1999 commented 5 months ago

Dear Qlalq,

Thanks for your response and URL reference. I have checked the MBPP-Py file, seems like it is very similar to HumanEval.

I am not sure the meaning of ``but I don't think the author handled it properly''. Do you mean the AgentCoder should use the MBPP-Py file to test the MBPP's effectiveness (Since currently we are following the DeepMind's script) or DeepMind script can be changed with the HumanEval prompt template with the MBPP-Py file?

Qlalq commented 5 months ago

I think both are fine, since the point is not to leak the test set, and giving the function name is sufficient

huangd1999 commented 5 months ago

Oh, thanks for your response. I understand now.

DorothyDUUU commented 3 months ago

Hi! I also found label leaking problems in programmer_mbpp.py. It is the generation of code, because the prompt in "./prompts/mbpp_prompt_update.txt" is prompting the gpt to generate data with description of requirement and test cases. Sometimes gpt will responds with the test case in the code generation. When this generated code pass to the test_designer, it will cause test case leakage. This figure is the prompt for code generation. image

This figure is the response of gpt, you could see at the end of the red box, test case will be appended on the generated code. But as I go through the whole files, there is no way to avoid this.

image

It is easy to tackle, just a reminder that providing test cases directly would cause test case leakage. Thus it is unfair to compare with other methods. :)

huangd1999 commented 3 months ago

Hi, you can check the discussion above about the prompt format of MBPP, it contains dataset provided tests (not EvalPlus) in the code generation :

google mbpp prompt recommendation (https://github.com/google-research/google-research/tree/master/mbpp):`:%60) You are an expert Python programmer, and here is your task: {prompt} Your code should pass these tests:\n\n{tests}\n[BEGIN]\n{code}\n[DONE]` You can also see Fig. 1 in https://arxiv.org/pdf/2108.07732. MBPP's tests are available for LLMs.

huangd1999 commented 3 months ago

I think it is a good way: It is easy to tackle, just a reminder that providing test cases directly would cause test case leakage.

Recommend construct a PR to update it.

DorothyDUUU commented 3 months ago

Thanks for your reply. I understood that this problem is due to the dataset because the data generation and data testing are based on the same test case.

In addition, I also have a problem regarding the test executor. In test_executor_mbpp.py, there is no procedure to utilize the generated test sample to validate these code snippets. This part is different from the procedure in the paper. Is this part we should modify? E.g. change either function test_agent with generated test samples.

Thanks. :)

huangd1999 commented 3 months ago

Hi, I have checked the functions in the current version (maybe due to the brainstreams in my previous version), which then forgot to use the tests generated by test_designer.

My previous consideration is that I want to evaluate the coverage and bug detection of tests for MBPP generated by test designers and dataset-provided public test cases. I will try to revise the current code and add more discussion about this part of MBPP in the paper(but currently busy on another project.

DorothyDUUU commented 3 months ago

Understood! Thanks for replying. :)