salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.74k stars 401 forks source link

Unexpected outputs using CodeT5+ 2B / 6B #97

Closed Mobius-Ash closed 1 year ago

Mobius-Ash commented 1 year ago

Hello,

Thanks to your work and public models! However, when I tried to use the example code provided in your repository with CodeT5+ 2B / 6B, I was not getting the expected output. The output seems incorrect, and sometimes be unrelated to my inputs.

Could you please help me identify what I might be doing wrong? What changes should I make to the input to get the correct output?

Environments:

I have not tested on CodeT5+ 16B because I can always get good results using StarCoder 15B, so I believe that the CodeT5+ 16B model can perform well.

My inputs is :

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

checkpoint = "Salesforce/codet5p-6b"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint,
                                              torch_dtype=torch.float16,
                                              trust_remote_code=True,
                                              revision="main"
                                             ).to(device)
inputs = tokenizer.encode("def binary_search(nums, target):\n    ", return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=128, do_sample=True, temperature=0.9, top_p=0.8, top_k=0)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

and it's outputs is:

"  # type: str
    size = ""
    for i in range(len(nums)):
        size += str(nums[i]) + ","
    size = size[0:-1]
    size = '"' + size + '"'
    print(nums)
    print(size)
    print(target)
    return nums, size, target

def create_nums(start, end, step):
    nums = []
    for i in range(start, end, step):
        nums.append(i

Thank you for your help!

yuewang-cuhk commented 1 year ago

Hi, thanks for your interest in our work! For these larger CodeT5+ models, they require to pass some prompts to the decoder via decoder_input_ids to achieve better generation performance, as their decoders are initialized from frozen off-the-shelf LLMs and need additional input information from the decoder input. Please follow the below example for code generation:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

checkpoint = "Salesforce/instructcodet5p-16b"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint,
                                              torch_dtype=torch.float16,
                                              low_cpu_mem_usage=True,
                                              trust_remote_code=True).to(device)

encoding = tokenizer("def print_hello_world():", return_tensors="pt").to(device)
encoding['decoder_input_ids'] = encoding['input_ids'].clone()
outputs = model.generate(**encoding, max_length=15)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Mobius-Ash commented 1 year ago

Awesome! It works.