seketeam / EvoCodeBench

An Evolving Code Generation Benchmark Aligned with Real-world Code Repositories
Apache License 2.0
30 stars 2 forks source link

Pass@1 score by oracle code is only 36.36% #9

Open sfc-gh-hhan opened 1 month ago

sfc-gh-hhan commented 1 month ago

Despite following this, the environmental setting is still problematic. I ran the Pass@1 evaluation using the oracle code snippets from data.jsonl:

Running pass@1 for local_infilling oracle_greedy
TODO Completions:  275
100%|██████████| 275/275 [13:33<00:00,  2.96s/it]
pass_at_1: 36.36363636363637%

My conjecture is that, as some repositories do not specify their package versions, the environmental set up is prune to be stale.

A best way is providing a docker like bigcode-evaluation-harness. I notice that you are making a docker. I wonder when would it be available.

Before that, you may share the specific pip version and update the repo's requirements.txt with specific package versions by pip freeze.

pldlgb commented 1 month ago

Does this mean that the results of the paper are also not credible? Because a large part cannot be passed, the model will fail even if it is done correctly, resulting in unfair comparisons.

sfc-gh-hhan commented 1 month ago

As the GPT-4 Pass@1 score gets higher than the reported one (20.73 to 26.54) after fixing some environmental issues, I guess the environmental setup for the reported result in the paper is also problematic.

LJ2lijia commented 1 month ago

Thank you for your suggestions. We would like to clarify three points:

(1) First, in our experimental environment, we ensured that all reference code could pass all test cases. This result demonstrates that our execution environment provides the required packages to support the reference code. We believe our experimental results are credible.

(2) Recently, we found that the code generated by GPT-4 or other models may rely on additional third-party libraries. These additional libraries are not in the requirement.txt and can only be configured manually. We will address this problem in future work.

(3) Your suggestion is great @sfc-gh-hhan. We will release dock files to facilitate the evaluation. Thank you.

sfc-gh-hhan commented 1 month ago

Thank you for your response:

(1, 3) It's a relief that all the reference code passed all the test cases. I am really looking forward to running my experiments on your benchmark. May I ask for a rough timeline regarding the Docker release?

(2) That is an interesting point. Should we regard them as incorrect despite not providing the explicit list of installed packages? I'm happy to hear your thoughts.