microsoft / CodeT

MIT License
599 stars 76 forks source link

Issue reproducing codex numbers #7

Closed martiansideofthemoon closed 5 months ago

martiansideofthemoon commented 1 year ago

Dear authors, Thank you for releasing the data accompanying CodeT. However, I was having some trouble reproducing the Codex numbers reported in the paper. The original paper reports 47.0 @1, 74.9 @10, 92.1 @100 for Codex. I tried to calculate the scores using the data provided in the repository.

First, I converted it to a format compatible with the official OpenAI HumanEval script.

import argparse
import json
from human_eval.data import read_problems

parser = argparse.ArgumentParser()
parser.add_argument('--results_file', default="codet-generations/HumanEval_davinci002_temp0.8_topp0.95_num100_max300_code_solution.jsonl")
parser.add_argument('--outputs_file', default="codet-generations/HumanEval_davinci002_temp0.8_topp0.95_num100_max300_code_solution_outputs.jsonl")
args = parser.parse_args()

with open(args.results_file, "r") as f:
    results = [json.loads(line) for line in f.read().strip().split("\n")]

problems = read_problems()
problem_ids_prompts = [(x, problems[x]["prompt"]) for x in problems]

outputs = []

for res in results:
    # find task_id
    task_ids = []
    def_line = [i for i, x in enumerate(res["prompt"].split("\n")) if x.startswith("def ")][0]
    def_line = "\n".join(res["prompt"].split("\n")[def_line:def_line + 2])
    for task_id1, prompt1 in problem_ids_prompts:
        if def_line in prompt1:
            task_ids.append(task_id1)
    assert len(task_ids) == 1
    for sample in res["samples"]:
        outputs.append(json.dumps({
            "task_id": task_ids[0],
            "completion": sample
        }))

with open(args.outputs_file, "w") as f:
    f.write("\n".join(outputs) + "\n")

After running this through the OpenAI script, I get {'pass@1': 0.1653, 'pass@10': 0.6257, 'pass@100': 0.8537}. What could I be doing incorrectly here? Also, any reason to choose n=100 instead of n=200 like Chen et al. 2021 and CodeGen paper? I'm guessing the pass@100 estimation will be more accurate with n=200?

Thank you! Kalpesh

martiansideofthemoon commented 1 year ago

Quick update on this. I independently generated text from code-davinci-002 with the same hyperparameters (p=0.95, t=0.8, n=200). The results I got were better than the data available in this repository and similar to those in the paper.

{'pass@1': 0.20344512195121955, 'pass@10': 0.6839959515336559, 'pass@100': 0.91049547719406}

If I use a simple heuristic to truncate incomplete lines / truncate lines ending with a : this rises to,

{'pass@1': 0.23917682926829273, 'pass@10': 0.7161137949932483, 'pass@100': 0.9262563823165132}

Finally, truncating code till it does not give errors on ast.parse improves performance to,

{'pass@1': 0.3730487804878049, 'pass@10': 0.793568620027091, 'pass@100': 0.9432289234601097}