Open xihuai18 opened 1 month ago
HumanEval: Lines 226-229 are implemented in the function test_agent_concurrency, which is used in the completion lists + test lists. So I think the canonical solution does not leak in this function.
MBPP:
As shown in https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/programmer_mbpp.py#L38 and google mbpp prompt recommendation (https://github.com/google-research/google-research/tree/master/mbpp):You are an expert Python programmer, and here is your task: {prompt} Your code should pass these tests:\n\n{tests}\n[BEGIN]\n{code}\n[DONE]
You can also see Fig. 1 in https://arxiv.org/pdf/2108.07732. MBPP's tests are available for LLMs.
By the way, we have updated the implementation for AgentCoder, you can pull it to obtain more readable source codes.
That's true, but I don't think the author handled it properly, only the function name needs to appear in the prompt (e.g., Write a function to check if the given number is woodball or not. The beginning of the generated content is as follows:def is_woodall(x)), or set “entry_point” as Humaneval does.
https://github.com/noahshinn/reflexion/blob/main/programming_runs/benchmarks/mbpp-py.jsonl The prompts here use the approach I've described
Dear Qlalq,
Thanks for your response and URL reference. I have checked the MBPP-Py file, seems like it is very similar to HumanEval.
I am not sure the meaning of ``but I don't think the author handled it properly''. Do you mean the AgentCoder should use the MBPP-Py file to test the MBPP's effectiveness (Since currently we are following the DeepMind's script) or DeepMind script can be changed with the HumanEval prompt template with the MBPP-Py file?
I think both are fine, since the point is not to leak the test set, and giving the function name is sufficient
Oh, thanks for your response. I understand now.
It seems that the code uses statistics from performing the tests in the dataset, would there be data / label leaking?
solutions leaking https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/test_executor_humaneval.py#L226-L229
https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/programmer_humaneval.py#L33
test result leaking https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/test_executor_mbpp.py#L144-L149
https://github.com/huangd1999/AgentCoder/blob/c0c8446d8ec3a2a5da7189d3ef8bbadeae17c5ee/src/programmer_mbpp.py#L30