huangd1999 / AgentCoder

This Repo is the official implementation of AgentCoder and AgentCoder+.
122 stars 19 forks source link

Label Leaking? #3

Open xihuai18 opened 1 month ago

xihuai18 commented 1 month ago

It seems that the code uses statistics from performing the tests in the dataset, would there be data / label leaking?

solutions leaking

test result leaking

huangd1999 commented 1 month ago

HumanEval: Lines 226-229 are implemented in the function test_agent_concurrency, which is used in the completion lists + test lists. So I think the canonical solution does not leak in this function.

MBPP: As shown in and google mbpp prompt recommendation ( are an expert Python programmer, and here is your task: {prompt} Your code should pass these tests:\n\n{tests}\n[BEGIN]\n{code}\n[DONE] You can also see Fig. 1 in MBPP's tests are available for LLMs.

By the way, we have updated the implementation for AgentCoder, you can pull it to obtain more readable source codes.

Qlalq commented 3 weeks ago

That's true, but I don't think the author handled it properly, only the function name needs to appear in the prompt (e.g., Write a function to check if the given number is woodball or not. The beginning of the generated content is as follows:def is_woodall(x)), or set “entry_point” as Humaneval does.

Qlalq commented 3 weeks ago The prompts here use the approach I've described

huangd1999 commented 3 weeks ago

Dear Qlalq,

Thanks for your response and URL reference. I have checked the MBPP-Py file, seems like it is very similar to HumanEval.

I am not sure the meaning of ``but I don't think the author handled it properly''. Do you mean the AgentCoder should use the MBPP-Py file to test the MBPP's effectiveness (Since currently we are following the DeepMind's script) or DeepMind script can be changed with the HumanEval prompt template with the MBPP-Py file?

Qlalq commented 3 weeks ago

I think both are fine, since the point is not to leak the test set, and giving the function name is sufficient

huangd1999 commented 3 weeks ago

Oh, thanks for your response. I understand now.