I want to discuss the impact of hyperparameters in PEFT/LoRA on CodeT5+

NTDXYG commented 1 year ago

I tried to fine tune the code generation tasks on specific domains using lora for the 220M, 770M, 2B, 6B models. When I kept the hyperparameters consistent (target_modules I set to default parameters for 220M, 770M, target_modules I set to ['q_proj', 'v_proj'] for 2B, 6B, and the rest of the parameters including r=8, lora_alpha=32, lora_dropout=0.1) , I was surprised to find that on the BLEU metric 770M>220M>2B>6B.

Is anyone else experiencing the same confusion as me? Because intuitively the larger the number of parameters of the model, the better the performance on downstream tasks should be.

mathav95raj commented 1 year ago

What was your hardware spec? Did you try with the fine tuning script in the repo?

Ch3nYe commented 1 year ago

I have the same problem, I frozen decoder, only train the encoder. model of 220m could get 0.8+ performance, while model of 2b only get 0.2+ performance. Is the "shallow encoder deep decoder" model has some problem in the fine-tune training?

By the way, I use model.decoder.requires_grad_(False) to frozen decoder, and also use transformers.Seq2SeqTrainer.

yuewang-cuhk commented 1 year ago

Hi both, this might be due to the inference difference between CodeT5+ 220M/770M and 2B/6B/16B models. The former models are pretrained from scratch while the latter group of models utilize a frozen GPT-style LLMs as the deep decoder. So to inference on CodeT5 >=2B models, we suggest also feeding some prefix prompts to the decoder to provide more contexts to the models, as we did in the HumanEval evaluation here. Such operation can help to have a better compatibility with the default behaviours of GPT models. Noth that for CodeT5+ 220M/770M, they do not need such additional prefix prompts as they are pretrained from scratch.

NTDXYG commented 1 year ago

Hi both, this might be due to the inference difference between CodeT5+ 220M/770M and 2B/6B/16B models. The former models are pretrained from scratch while the latter group of models utilize a frozen GPT-style LLMs as the deep decoder. So to inference on CodeT5 >=2B models, we suggest also feeding some prefix prompts to the decoder to provide more contexts to the models, as we did in the HumanEval evaluation here. Such operation can help to have a better compatibility with the default behaviours of GPT models. Noth that for CodeT5+ 220M/770M, they do not need such additional prefix prompts as they are pretrained from scratch.

It's useful! Thanks for your replying.

NTDXYG commented 1 year ago

Hi both, this might be due to the inference difference between CodeT5+ 220M/770M and 2B/6B/16B models. The former models are pretrained from scratch while the latter group of models utilize a frozen GPT-style LLMs as the deep decoder. So to inference on CodeT5 >=2B models, we suggest also feeding some prefix prompts to the decoder to provide more contexts to the models, as we did in the HumanEval evaluation here. Such operation can help to have a better compatibility with the default behaviours of GPT models. Noth that for CodeT5+ 220M/770M, they do not need such additional prefix prompts as they are pretrained from scratch.

I also have a point of confusion regarding the process of fine-tuning the CodeT5+ 2B model. In the fine-tuning of the 220M and 770M models, this is my code:

...
source_ids, source_mask, target_ids, target_mask = batch
outputs = model(input_ids = source_ids, attention_mask = source_mask, labels = target_ids, decoder_attention_mask = target_mask)
loss = outputs.loss
...

Here, source_ids just my input and target_ids just my output. I'm not sure if something needs to be changed when fine-tuning the 2B model. For example, does target_ids need to be changed to my input + output? Because I see that the additional decoder_input_ids parameter is added to the generation process.

I look forward to hearing from you, thank you!

salesforce / CodeT5

I want to discuss the impact of hyperparameters in PEFT/LoRA on CodeT5+ #119