Single-line Infilling Results reproduction

shivamag125 commented 7 months ago

Hello,

I am trying to reproduce the infilling results on HumanEval (Table 14 CodeLLAMA 7B SPM, pass@1=83%). I am using the single-line benchmark from https://github.com/openai/human-eval-infilling. I use the below code to generate the samples.

from human_eval_infilling.data import write_jsonl, read_problems
from tqdm import tqdm
from transformers import (
    AutoModelForCausalLM,
    CodeLlamaTokenizer,
)
import torch

model_name = "codellama/CodeLlama-7b-hf"
load_in_8bit = "False"
device_map = "auto"
max_gen_len = 128
problems = read_problems(benchmark_name="single-line")
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_8bit=load_in_8bit,
        device_map=device_map,
        trust_remote_code=True,
        torch_dtype=torch.bfloat16
    )
tokenizer = CodeLlamaTokenizer.from_pretrained(model_name, trust_remote_code=True, suffix_first=True)

def generate_one_completion(pre,suf):
    prompt = pre+"<FILL_ME>"+suf
    input_ids = tokenizer(prompt, suffix_first=True, return_tensors="pt")["input_ids"].to('cuda')
    generation_tokens = model.generate(
            input_ids,
            max_new_tokens=max_gen_len,
            temperature=0.2
        )
    outputs = tokenizer.batch_decode(generation_tokens[:, input_ids.shape[1]:], skip_special_tokens = True)[0]
    return outputs

num_samples_per_task = 1
samples = [
    dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"], problems[task_id]["suffix"]))
    for task_id in tqdm(problems)
    for _ in range(num_samples_per_task)
]
print(len(samples))
write_jsonl("samples_base_pretrained_codellama.jsonl", samples)

Next I run the following for computing pass@1. I obtain pass@1= 0.73281 which is much smaller than the reported results.

evaluate_infilling_functional_correctness samples_base_pretrained_codellama.jsonl --benchmark_name=single-line

Can you please help with the following:

Are the benchmarks and prompts for evaluation correct?
Is there any post-processing required on the generated codes? (e.g. code sanitation)
Are there any hyperparameter recommended? (e.g. temperature, decoding strategy?)

timxx commented 6 months ago

I use the instruct model , only got {'pass@1': 0.05227492739593417} :(

But I use the raw <PRE> <SUF> <MID> tokens, as my test it works fine than <FILL_ME>

stgzr commented 6 months ago

Same issues. Cannot reproduce the infilling results as paper reported, a bit lower. Any ideas?

faabian commented 6 months ago

Dear @shivamag125 , @timxx and @stgzr, thanks for reporting!

@timxx : The instruction models are not intended to be used for infilling, please use the pretrained models.

@shivamag125 and @stgzr : The hyperparameters (greedy decoding i.e. temperature=0) are reported in the paper (Table 14). Note that you need to compare to the models with LCFT in the table since pretrained models without LCFT have not been released. Moreover, a frequent problem for infilling models is knowing where to stop. Our code cuts the generation after the first linebreak in the single line infilling task.

stgzr commented 6 months ago

Dear @shivamag125 , @timxx and @stgzr, thanks for reporting!

@timxx : The instruction models are not intended to be used for infilling, please use the pretrained models.

@shivamag125 and @stgzr : The hyperparameters (greedy decoding i.e. temperature=0) are reported in the paper (Table 14). Note that you need to compare to the models with LCFT in the table since pretrained models without LCFT have not been released. Moreover, a frequent problem for infilling models is knowing where to stop. Our code cuts the generation after the first linebreak in the single line infilling task.

Thank you for the detailed reply. I will try to check my implementation. Another question: when to stop generation in the multi-line and random-span tasks, using \<EOT>?

shivamag125 commented 5 months ago

Thanks! Using a stopping condition like \n reproduces the numbers.

faabian commented 5 months ago

Dear @shivamag125 , @timxx and @stgzr, thanks for reporting! @timxx : The instruction models are not intended to be used for infilling, please use the pretrained models. @shivamag125 and @stgzr : The hyperparameters (greedy decoding i.e. temperature=0) are reported in the paper (Table 14). Note that you need to compare to the models with LCFT in the table since pretrained models without LCFT have not been released. Moreover, a frequent problem for infilling models is knowing where to stop. Our code cuts the generation after the first linebreak in the single line infilling task.

Thank you for the detailed reply. I will try to check my implementation. Another question: when to stop generation in the multi-line and random-span tasks, using ?

For multiline, there exist other stopping heuristics (see TruncationParameters here https://github.com/Eric-Wallace/codex/blob/main/infill_evaluation.py), but IIRC both https://github.com/bigcode-project/bigcode-evaluation-harness and our internal code use only EOT as stop symbol in multiline.

meta-llama / codellama

Single-line Infilling Results reproduction #172