Issue with Degrading Text Generation Quality in Fine-Tuning BioGPT

Good afternoon! Thank you for open-sourcing such fantastic work! I have been trying to fine-tune bioGPT on a subset of textual data to provide more knowledge on some specific domain. However, when I am training the model for one epoch on a very small subset of PubMed abstracts, the model loses the ability to generate comprehensive English and seems to output just random words. Would you be able to provide insights into why the model degrades so fast?

I am fine-tuning the base pre-trained model from https://huggingface.co/microsoft/biogpt:

from transformers import BioGptTokenizer, BioGptForCausalLM

abs_tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
abs_model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")

To fine-tune, I am freezing all layers besides the last one and feeding new data into self-supervised training to predict the next word based on the previous:


## Freeze embed_tokens and embed_positions
for param in abs_model.biogpt.embed_tokens.parameters():
    param.requires_grad = False

for param in abs_model.biogpt.embed_positions.parameters():
    param.requires_grad = False

# Freeze parameters of the first 23 layers
for i, layer in enumerate(abs_model.biogpt.layers):
    if i < 23:  # Freeze all layers except the last one
        for param in layer.parameters():
            param.requires_grad = False

for param in abs_model.biogpt.layer_norm.parameters():
    param.requires_grad = True

# Check which parameters are trainable
for name, param in abs_model.named_parameters():
    print(name, param.requires_grad)

In this repository, I couldn't find a dataset definition to train the foundation model, so I came up with my own:

import torch
from transformers import BioGptTokenizer, BioGptForCausalLM
from torch.utils.data import Dataset, DataLoader
from transformers import Trainer, TrainingArguments, EarlyStoppingCallback
import os

# Prepare the dataset
class AbstractsDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length):
        self.tokenizer = tokenizer
        self.texts = texts
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts.iloc[idx]
        # Tokenize and pad the sequence to the max_length
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            return_tensors='pt',  # Return PyTorch tensors
            padding='max_length',  # Add padding
            truncation=True
        )
        # Shift the labels to the right to predict the next token
        input_ids = encoding['input_ids'].squeeze(0)  # Remove the batch dimension added by `return_tensors`
        labels = input_ids.clone()
        labels[:-1] = input_ids[1:]
        labels[-1] = -100  # Typically we set the label for the last token to -100 to ignore it in loss calculation
        return {
            'input_ids': input_ids,
            'labels': labels
        }

# Create the dataset
train_dataset = AbstractsDataset(train_abstracts, abs_tokenizer, max_length=512)  # Adjust max_length as needed
val_dataset = AbstractsDataset(val_abstracts, abs_tokenizer, max_length=512)  # Adjust max_length as needed

Below is the code for training:

# Training arguments
training_args = TrainingArguments(
    output_dir=f'./results_{formatted_date}',          # output directory
    num_train_epochs=1,              # number of training epochs, adjust as needed
    per_device_train_batch_size=4,   # batch size per device during training, adjust based on your GPU(s)
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    eval_steps=500,                  # evaluation will be performed every 500 steps
    save_steps=500,                   # save the model every 1000 steps
    weight_decay=0.01,               # strength of weight decay
    load_best_model_at_end=True,     # load the best model when finished training (based on `metric_for_best_model`)
    logging_dir='./logs',            # directory for storing logs
    metric_for_best_model="loss",    # use loss to evaluate the best model
    evaluation_strategy="steps",     # evaluate at regular intervals
    greater_is_better=False,         # lower loss indicates a better model
    logging_steps=1,
)

# Initialize Trainer
trainer = Trainer(
    model=abs_model,
    args=training_args,
    train_dataset=train_dataset,
    # If you have a validation dataset, you can include it here
    eval_dataset=val_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)]  # Stop training after 10 evaluations without improvement
)

trainer.args._n_gpu = 1

# Train the model
trainer.train()

# Save the fine-tuned model
abs_model.save_pretrained(f'./saved_model/updated_{formatted_date}')

After training one epoch with 10,000 abstracts, the validation loss decreased from 11.34 to 3.40. However, the quality of the generated text became way worse. For example, the query "COVID-19 is" for the fine-tunned version gives the following:

COVID-19 is on world of, to the and this has the to the of world., its of. patients and to the world, have a on illness caused a., the of.,, and of people.

While the base pre-trained BioGPT provides a more reasonable answer:

COVID-19 is still an ongoing pandemic.

This leads me to wonder: How is it possible for the loss to decrease yet the text quality to worsen? Were there any specific training techniques or considerations used in the initial training of BioGPT that I might be missing?

Any insights or suggestions would be greatly appreciated. Thank you for your time and assistance.

For your reference, I am attaching a jupyter notebook for the abovementioned run. [Uploading 2.2-FineTune… - JupyterLab.pdf…]()

microsoft / BioGPT

Issue with Degrading Text Generation Quality in Fine-Tuning BioGPT #118