Open NTDXYG opened 1 year ago
What was your hardware spec? Did you try with the fine tuning script in the repo?
I have the same problem, I frozen decoder, only train the encoder. model of 220m could get 0.8+ performance, while model of 2b only get 0.2+ performance. Is the "shallow encoder deep decoder" model has some problem in the fine-tune training?
By the way, I use model.decoder.requires_grad_(False)
to frozen decoder, and also use transformers.Seq2SeqTrainer
.
Hi both, this might be due to the inference difference between CodeT5+ 220M/770M and 2B/6B/16B models. The former models are pretrained from scratch while the latter group of models utilize a frozen GPT-style LLMs as the deep decoder. So to inference on CodeT5 >=2B models, we suggest also feeding some prefix prompts to the decoder to provide more contexts to the models, as we did in the HumanEval evaluation here. Such operation can help to have a better compatibility with the default behaviours of GPT models. Noth that for CodeT5+ 220M/770M, they do not need such additional prefix prompts as they are pretrained from scratch.
Hi both, this might be due to the inference difference between CodeT5+ 220M/770M and 2B/6B/16B models. The former models are pretrained from scratch while the latter group of models utilize a frozen GPT-style LLMs as the deep decoder. So to inference on CodeT5 >=2B models, we suggest also feeding some prefix prompts to the decoder to provide more contexts to the models, as we did in the HumanEval evaluation here. Such operation can help to have a better compatibility with the default behaviours of GPT models. Noth that for CodeT5+ 220M/770M, they do not need such additional prefix prompts as they are pretrained from scratch.
It's useful! Thanks for your replying.
Hi both, this might be due to the inference difference between CodeT5+ 220M/770M and 2B/6B/16B models. The former models are pretrained from scratch while the latter group of models utilize a frozen GPT-style LLMs as the deep decoder. So to inference on CodeT5 >=2B models, we suggest also feeding some prefix prompts to the decoder to provide more contexts to the models, as we did in the HumanEval evaluation here. Such operation can help to have a better compatibility with the default behaviours of GPT models. Noth that for CodeT5+ 220M/770M, they do not need such additional prefix prompts as they are pretrained from scratch.
I also have a point of confusion regarding the process of fine-tuning the CodeT5+ 2B model. In the fine-tuning of the 220M and 770M models, this is my code:
...
source_ids, source_mask, target_ids, target_mask = batch
outputs = model(input_ids = source_ids, attention_mask = source_mask, labels = target_ids, decoder_attention_mask = target_mask)
loss = outputs.loss
...
Here, source_ids just my input and target_ids just my output.
I'm not sure if something needs to be changed when fine-tuning the 2B model. For example, does target_ids need to be changed to my input + output
? Because I see that the additional decoder_input_ids
parameter is added to the generation process.
I look forward to hearing from you, thank you!
I tried to fine tune the code generation tasks on specific domains using lora for the 220M, 770M, 2B, 6B models. When I kept the hyperparameters consistent (target_modules I set to default parameters for 220M, 770M, target_modules I set to ['q_proj', 'v_proj'] for 2B, 6B, and the rest of the parameters including r=8, lora_alpha=32, lora_dropout=0.1) , I was surprised to find that on the BLEU metric 770M>220M>2B>6B.
Is anyone else experiencing the same confusion as me? Because intuitively the larger the number of parameters of the model, the better the performance on downstream tasks should be.