load checkpoints and general fine tuning advice

I fine-tuned t5-large for paraphrase generation for 2 epoches and the paraphrases generated looks good. When I trained for 11 epochs and the model seems overfitted (the paraphrases generated is similar to the original sentence).

1.I want to check the performance of checkpoints saved, but I don't know how to do it. I tried PATH ='./t5_paraphrase/checkpointepoch=10.ckpt' model =T5ForConditionalGeneration.from_pretrained(PATH)

gives error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

I also tried (https://pytorch-lightning.readthedocs.io/en/latest/weights_loading.html ) model =T5ForConditionalGeneration.load_from_checkpoint(PATH) AttributeError: type object 'T5ForConditionalGeneration' has no attribute 'load_from_checkpoint'.

Do you have any recommendation for fine tuning T5? I know this question is too broad. I have explored all the data set I have. For hyperparameters, I read the doc for pytorch lightning and found auto_lr_find, auto_scale_batch_size, fast_dev_run may be fun to try. However, because of the definition of t_total in train_dataloader, these will report error. So maybe there is no more tricks on this side.
For paraphrase generation using T5 as a text-to-text task, I don't know how to utilize the negative examples directly here. Any recommendation? I plan to further fine tune T5-large's paraphrase identification with my data set (with positive and negative examples) and then used this fine tuned version to further fine tune on paraphrase generation. I am still investigating how to do this, so any help will be appreciated.
In your example, you use T5ForConditionalGeneration. https://huggingface.co/transformers/model_doc/t5.html#t5forconditionalgeneration It is not very clear to me when I need to use T5Model rather than T5ForConditionalGeneration. Any resources on this?

Thanks!!

Hi @mengyahuUSTC-PU , the checkpoints saved by lightening can't be loaded directly in HF models. To do that

model = T5FineTuner.load_from_checkpoint("lightening checkpoint path")

# save the model in HF format with
model.model.save_pretrained("hf_model")

# after this you can load the hf_model using
model =T5ForConditionalGeneration.from_pretrained("hf_model")

As far as I have seen in my experiments, T5 pretty insensitive to hyperparameters. Most of the times my lr is either 3e-4 or 3e-5, with t5-base my BS is 8 (for V100) and grad accumulation steps 16, weight_decay is 0.0 and adam_epsilon is 1e-8, I've used these hparams in all my experiments and it has given really good results. But feel free to experiment.
Paraphrase identification is one of the GLUE tasks so it's used in the pre-training mixture. You can find how the input is processed in the appendix of the paper. Page no 46, MRPC task.
for generation you should use T5ForConditionalGeneration. It's almost same as T5Model but adds a LM head on top of it for generation. T5Model is just stack of encoder and decoder. In most of the cases you'll want to use T5ForConditionalGeneration

Thanks for your quick reply! These are very helpful! @patil-suraj

Repeat two of the above questions: a. As you mentioned T5 seems insensitive to hyperparameters in your experiments, do you have other recipes for improving model performance? b. do you have any suggestion for utilizing negative examples for text generation in T5?

2.1 In evaluation part, you use 'model.model.generate(..)'. I do not understand. Why not 'model.generate(..)' 2.2For the sample examples, you didn't use model.eval(), but in the later part, you used it. From Huggingface's doc, it seems not necessary when the model is 'from pretrain': "The model is set in evaluation mode by default using model.eval() (Dropout modules are deactivated) To train the model, you should first set it back in training mode with model.train()" On the other side, all the examples in the code, none of them uses 'model.train()' to set the mode. They directly train the model. I am confused.

I am also a little confused about the prefix. On huggingface'T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: translate English to German: …, summarize: …. For more information about which prefix to use, it is easiest to look into Appendix D of the paper '. Thus, I think prefix notifies the T5 which task he should be performing. However, in the first two examples here, you seems only add "< / s >" at the end of the sentence but no 'prefix'. Could you tell me why? Do that mean your fine-tuned model will not do pretrain tasks of T5 but only your trained task so you don't need a prefix?
Also, MRPC and QQP are both paraphrase identification. If I want to fine tune, should I use my data set to fine with both of their prefix, or fine tune with one of their prefix or create my own prefix?
I set early_stop_callback= True and set max_epochs=32, then it stops at epoch 11. But if I set max_epochs = 6, it stops at epoch 3. I don't understand, as I thought it will stop at epoch 6. I have the same random seed.
Another strange thing during training, I saw this on the screen: Epoch 10: 100%............(time, loss et al)... INFO:main:avg_train_loss = tensor(..) INFO:main:epoch = 8 ........ Why the epoch number is not the same?!

For people who is following this thread, @patil-suraj answered the questions in the other thread methoded here, which contains other general questions about T5.

Hi @patil-suraj ,Thank you very much for this helpful repository🙂 I finished fine-tuning and tried to load model from pytorch-lightning checkpoint, but it doesn't work.
What can I do to improve?

model = T5FineTuner.load_from_checkpoint("dir of my .ckpt file") returns error; AttributeError: 'dict' object has no attribute 'model_name_or_path'.

My train_params are like below.

logger = TensorBoardLogger(LOG_DIR , name=NAME)
early_stopping_callback = EarlyStopping(monitor='val_loss', patience=5)
checkpoint_callback = ModelCheckpoint(
  dirpath=CHECKPOINT_DIR,
  filename=FILENAME,
  save_top_k=1,
  verbose=True,
  monitor="val_loss",
  mode="min"
)
train_params = dict(
    logger = logger,
    callbacks=[early_stopping_callback, checkpoint_callback],
    accumulate_grad_batches=args.gradient_accumulation_steps,
    gpus=args.n_gpu,
    max_epochs=args.num_train_epochs,
    precision= 16 if args.fp_16 else 32,
    amp_level=args.opt_level,
    gradient_clip_val=args.max_grad_norm,
)

And this is my T5FineTuner;

class T5FineTuner(pl.LightningModule):
    def __init__(self, hparams):
        super().__init__()
        self.save_hyperparameters(hparams)
        self.model = T5ForConditionalGeneration.from_pretrained(hparams.model_name_or_path)
        self.tokenizer = T5Tokenizer.from_pretrained(hparams.tokenizer_name_or_path, is_fast=True)

    def forward(self, input_ids, attention_mask=None, decoder_input_ids=None, 
                decoder_attention_mask=None, labels=None):
        return self.model(
            input_ids,
            attention_mask=attention_mask,
            decoder_input_ids=decoder_input_ids,
            decoder_attention_mask=decoder_attention_mask,
            labels=labels
        )

    def _step(self, batch):
        labels = batch["target_ids"]

        # All labels set to -100 are ignored (masked), 
        # the loss is only computed for labels in [0, ..., config.vocab_size]
        labels[labels[:, :] == self.tokenizer.pad_token_id] = -100

        outputs = self(
            input_ids=batch["source_ids"],
            attention_mask=batch["source_mask"],
            decoder_attention_mask=batch['target_mask'],
            labels=labels
        )
        loss = outputs[0]
        logits = outputs[1]
        return loss, logits

    def training_step(self, batch, batch_idx):
        target = batch["target_ids"]
        loss, logits = self._step(batch)
        self.log("train_loss", loss, prog_bar=True, logger=True)
        return {"loss": loss, "predictions": logits, "targets":target}

    def validation_step(self, batch, batch_idx):
        target = batch["target_ids"]
        loss, logits = self._step(batch)
        self.log("val_loss", loss, prog_bar=True, logger=True)
        return {"loss": loss, "predictions": logits, "targets":target}

    def test_step(self, batch, batch_idx):
        loss, logits = self._step(batch)
        self.log("test_loss", loss, prog_bar=True, logger=True)
        return loss

    def configure_optimizers(self):
        model = self.model
        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in model.named_parameters() 
                            if not any(nd in n for nd in no_decay)],
                "weight_decay": self.hparams.weight_decay,
            },
            {
                "params": [p for n, p in model.named_parameters() 
                            if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
            },
        ]
        optimizer = AdamW(optimizer_grouped_parameters, 
                          lr=self.hparams.learning_rate, 
                          eps=self.hparams.adam_epsilon)
        self.optimizer = optimizer

        scheduler = get_linear_schedule_with_warmup(
            optimizer, num_warmup_steps=self.hparams.warmup_steps, 
            num_training_steps=self.t_total
        )
        self.scheduler = scheduler

        return [optimizer], [{"scheduler": scheduler, "interval": "step", "frequency": 1}]

    def get_dataset(self, tokenizer, type_path, args):
        return TsvDataset(
            tokenizer=tokenizer, 
            data_dir=args.data_dir, 
            type_path=type_path, 
            input_max_len=args.max_input_length,
            target_max_len=args.max_target_length)

    def setup(self, stage=None):
        if stage == 'fit' or stage is None:
            train_dataset = self.get_dataset(tokenizer=self.tokenizer, 
                                             type_path="train.tsv", args=self.hparams)
            self.train_dataset = train_dataset

            val_dataset = self.get_dataset(tokenizer=self.tokenizer, 
                                           type_path="dev.tsv", args=self.hparams)
            self.val_dataset = val_dataset

            self.t_total = (
                (len(train_dataset) // (self.hparams.train_batch_size * max(1, self.hparams.n_gpu)))
                // self.hparams.gradient_accumulation_steps
                * float(self.hparams.num_train_epochs)
            )

    def train_dataloader(self):
        return DataLoader(self.train_dataset, 
                          batch_size=self.hparams.train_batch_size, 
                          drop_last=True, shuffle=True, num_workers=4)

    def val_dataloader(self):
        return DataLoader(self.val_dataset, 
                          batch_size=self.hparams.eval_batch_size, 
                          num_workers=4)

patil-suraj / exploring-T5

load checkpoints and general fine tuning advice #3