salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.74k stars 401 forks source link

Fine-tuning CodeT5+ 2B #112

Open antonio-mastropaolo opened 1 year ago

antonio-mastropaolo commented 1 year ago

Hello everyone!

Is anyone already come up with a script for fine-tuning CodeT5+ (2B,6B) on its own seq2seq task?

beneyal commented 1 year ago

I have no idea if this is correct or not, but I tried my hand at fine-tuning CodeT5+ 2B using this code:

from peft import get_peft_config, get_peft_model, LoraConfig, PeftConfig, PeftModel, PrefixTuningConfig, TaskType
from transformers import AutoConfig, AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments

model_id = "Salesforce/codet5p-2b"

# Took the config from here: https://www.philschmid.de/fine-tune-flan-t5-peft,
# figured that if the blog post uses "v" and "q" as LoRA submodules, I'll use "qkv_projection", but this might be wrong
peft_config = LoraConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=16, target_modules=["qkv_proj"], lora_alpha=32, lora_dropout=0.1)

# Without these settings, the model won't work
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True, revision="main")
config.decoder_start_token_id = tokenizer.bos_token_id
config.pad_token_id = tokenizer.pad_token_id

model = AutoModelForSeq2SeqLM.from_pretrained(model_id, config=config, trust_remote_code=True, revision="main")
model = get_peft_model(model, peft_config)

# Rest of the code is similar to the blog post above

label_pad_token_id = -100

data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

repository_id = f"{model_id.split('/')[1]}-lora"

training_args = Seq2SeqTrainingArguments(
    output_dir=repository_id,
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=3,
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    seed=1,
    run_name="codet5p-2b-lora",
    report_to="wandb",
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    tokenizer=tokenizer,
)

trainer.train()

I used one machine with an NVIDIA H100 80GB, training took about an hour and a half.

My results:

Hope this helps and that your training is better than mine 😅

Good luck!

antonio-mastropaolo commented 1 year ago

@beneyal Thank you very much for your answer. Actually, I did fine-tune CodeT5+ 2B and my code looks very similar to yours. So far, the prediction I obtained makes sense and I haven't noticed slow inference time. Below I post my code, perhaps it might be useful to you or others.

from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer, AutoTokenizer, AutoConfig
from peft import get_peft_config, get_peft_model, get_peft_model_state_dict, LoraConfig, TaskType
import torch
import os
import pandas as pd
from datasets import Dataset, load_dataset, load_metric
import datasets
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from torch.utils.data import DataLoader
from transformers import default_data_collator, get_linear_schedule_with_warmup
from tqdm import tqdm
import numpy as np

device = "cuda"
model_name_or_path = "Salesforce/codet5p-2b"
tokenizer_name_or_path = "Salesforce/codet5p-2b"
metric = load_metric("sacrebleu")

def preprocess_function(examples):
    inputs = [method for method in examples["preparedInput"]]
    targets = [comment for comment in examples["preparedComment"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

def compute_metrics(eval_preds):

    def postprocess_text(preds, labels):
        preds = [pred.strip() for pred in preds]
        labels = [[label.strip()] for label in labels]
        return preds, labels

    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    print(decoded_preds)
    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

config = AutoConfig.from_pretrained("Salesforce/codet5p-2b",trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

def main():
    # creating model
    peft_config = LoraConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1, target_modules=["q_proj","v_proj"])

    config.eos_token_id=2
    config.bos_token_id=1
    config.pad_token_id=0
    config.decoder_start_token_id=tokenizer.convert_tokens_to_ids(['<pad>'])[0]

    model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path,config=config, trust_remote_code=True)
    model = get_peft_model(model, peft_config)
    print(model.print_trainable_parameters())

    train_data = pd.read_csv('dataset/train-codeSummary.csv')
    val_data = pd.read_csv('dataset/eval-codeSummary.csv')

    train_data = train_data[['preparedInput','preparedComment']]
    val_data = val_data[['preparedInput','preparedComment']]

    train_dataset = Dataset.from_pandas(train_data)
    eval_dataset = Dataset.from_pandas(val_data)
    raw_datasets = datasets.DatasetDict({"train":train_dataset,"eval":eval_dataset})
    tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

    batch_size = 1
    args = Seq2SeqTrainingArguments(
          "./models/codet5-plus-peft/",
          evaluation_strategy = "epoch",
          learning_rate=5e-5,
          per_device_train_batch_size=batch_size,
          per_device_eval_batch_size=batch_size,
          #weight_decay=0.1,
          save_strategy='epoch',
          num_train_epochs=30,
          predict_with_generate=True    
    )

    data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

    trainer = Seq2SeqTrainer(
        model,
        args,
        train_dataset=tokenized_datasets['train'],
        eval_dataset=tokenized_datasets['eval'],
        data_collator=data_collator,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )

    trainer.train()

if __name__ == '__main__':
    main()
yuewang-cuhk commented 1 year ago

Hi there, we have provide an example finetuning script and please see here for more details. For bigger models such as 2B and 6B, please use Deepspeed for training acceleration.

avisoori-databricks commented 1 year ago

@yuewang-cuhk Thanks! Could you please share any specific guidelines on doing this with PEFT/ LoRA? Particularly the LoRA config e.g.target_modules=["q_proj","v_proj"] or something else

antonio-mastropaolo commented 1 year ago

Hi there, we have provide an example finetuning script and please see here for more details. For bigger models such as 2B and 6B, please use Deepspeed for training acceleration.

@yuewang-cuhk Many thanks for providing this script.

Is there a way to let the model stop when generating predictions instead of filling in all max_length_tokens (or so)? Once fine-tuned with the provided script the model always outputs very long predictions according to the max_length parameters. Here some examples

Mark this for the real keymap, as it doesn't allow us to  easily resume partial downloads. If it doesn't allow us to  easily resume partial downloads. If it's null, then this should not  also add this ability to the real key. This should not be called.  

You shouldn't care about it, so it's a partial downloads. If it's null, then this will  guarantee  as in order to add the header to the fragment.  You should be done, at the bottom of the list, at the bottom of the list.  You shouldn't care about it
2.get the int value and compare it to result.  If it is a P2 result, then it is a P2 result and the cache is a P2 result.  

The inference function is the following one:

def run_inference(tokenizer, model, dataset):
    model.eval()
    print(f"Starting Inference")
    for idx,row in dataset.iterrows():
        inputElement = row['preparedInput']
        encoding = tokenizer(inputElement, return_tensors="pt").to('cuda:0')
        #encoding['decoder_input_ids'] = encoding['input_ids'].clone()
        #outputs = model.generate(**encoding, max_length=128, eos_token_id=tokenizer.eos_token_id)
        outputs = model.generate(**encoding,
                                    do_sample=True,
                                    temperature=0.8,
                                    max_length=128,
                                    eos_token_id=tokenizer.eos_token_id,
                                    top_p=0.95)
        print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Thanks in adavance for any help you can provide on this!

zhuxunyu commented 11 months ago

Hi there, we have provide an example finetuning script and please see here for more details. For bigger models such as 2B and 6B, please use Deepspeed for training acceleration.

The finetuning script works when the model is 220m and 770m, however, for bigger models such as 2B and 6B, the script doesn’t work. I hope you check it.