huangd1999 / AgentCoder

This Repo is the official implementation of AgentCoder and AgentCoder+.
122 stars 19 forks source link

Label Leaking? #3

Open xihuai18 opened 1 month ago

xihuai18 commented 1 month ago

It seems that the code uses statistics from performing the tests in the dataset, would there be data / label leaking?

solutions leaking https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/test_executor_humaneval.py#L226-L229

https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/programmer_humaneval.py#L33

test result leaking https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/test_executor_mbpp.py#L144-L149

https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/programmer_mbpp.py#L30

huangd1999 commented 1 month ago

HumanEval: Lines 226-229 are implemented in the function test_agent_concurrency, which is used in the completion lists + test lists. So I think the canonical solution does not leak in this function.

MBPP: As shown in https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/programmer_mbpp.py#L38 and google mbpp prompt recommendation (https://github.com/google-research/google-research/tree/master/mbpp):You are an expert Python programmer, and here is your task: {prompt} Your code should pass these tests:\n\n{tests}\n[BEGIN]\n{code}\n[DONE] You can also see Fig. 1 in https://arxiv.org/pdf/2108.07732. MBPP's tests are available for LLMs.

By the way, we have updated the implementation for AgentCoder, you can pull it to obtain more readable source codes.

Qlalq commented 3 weeks ago

That's true, but I don't think the author handled it properly, only the function name needs to appear in the prompt (e.g., Write a function to check if the given number is woodball or not. The beginning of the generated content is as follows:def is_woodall(x)), or set “entry_point” as Humaneval does.

Qlalq commented 3 weeks ago

https://github.com/noahshinn/reflexion/blob/main/programming_runs/benchmarks/mbpp-py.jsonl The prompts here use the approach I've described

huangd1999 commented 3 weeks ago

Dear Qlalq,

Thanks for your response and URL reference. I have checked the MBPP-Py file, seems like it is very similar to HumanEval.

I am not sure the meaning of ``but I don't think the author handled it properly''. Do you mean the AgentCoder should use the MBPP-Py file to test the MBPP's effectiveness (Since currently we are following the DeepMind's script) or DeepMind script can be changed with the HumanEval prompt template with the MBPP-Py file?

Qlalq commented 3 weeks ago

I think both are fine, since the point is not to leak the test set, and giving the function name is sufficient

huangd1999 commented 3 weeks ago

Oh, thanks for your response. I understand now.