salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.66k stars 391 forks source link

CodeT5+ | Repeated <extra_id_1> in the generated tokens #105

Open fillassuncao opened 1 year ago

fillassuncao commented 1 year ago

Given the code bellow:

from transformers import T5ForConditionalGeneration, AutoTokenizer

checkpoint = "Salesforce/codet5p-220m"
device = "cpu"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode('def print_hello_world(): <extra_id_0>"', return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=60)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

The output I get is:

>>> print(tokenizer.decode(outputs[0], skip_special_tokens=False))
<pad><extra_id_0>
    print "Hello World<extra_id_1>
    print "Hello World<extra_id_1>
    print "Hello World<extra_id_1>
    print "Hello World<extra_id_1>
    print "Hello World<extra_id_1>
    print "Hello World<extra_id_1>
    print "Hello World<extra_id_1>
    print "Hello World<extra_id_1>
    print

I was not expecting to get multiple <extra_id_1> tokens. Is this known or expected?

yuewang-cuhk commented 1 year ago

Hi there, <extra_id_1> is also a special token and here this is a kind of unexpected output. We would suggest to use codet5p-220m and codet5p-770m models in the finetuning setting.

For zero-shot setting, some truncation strategies should be used to obtain your desired output. This is because that the model did not see such exact input and output pair during pretraining and was difficult for it to learn to when to stop the generation. For instance in HumanEval evalution, it is a common practice to use truncation based on some stop tokens to generate a desired clean output.