huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.4k stars 26.37k forks source link

Bert2bert on Swag with very low accuracy #11730

Closed helloworld123-lab closed 3 years ago

helloworld123-lab commented 3 years ago

Hello everyone,

I try to build multiple choice QA system using Bert2Bert. I follow the model given for Swag using t5 in https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb My complete code is here.https://colab.research.google.com/drive/1MAGCi5TC1S6GNW3CFEB0f2cMkQ5gpxdN?usp=sharing

To integrate bert2bert model, I follow this https://colab.research.google.com/drive/1Ekd5pUeCX7VOrMx94_czTkwNtLN32Uyu?usp=sharing notebook.

I created a Bert2BertFineTuner class considering T5FineTuner class in https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb I add the following changes to T5FineTuner class for Bert2Bert consideration. I just add

EncoderDecoderModel.from_encoder_decoder_pretrained(.) and

BertTokenizer.from_pretrained(.)

class Bert2BertFineTuner(pl.LightningModule):
  def __init__(self, hparams):
    super(Bert2BertFineTuner, self).__init__()
    self.hparams = hparams

    #self.model = T5ForConditionalGeneration.from_pretrained(hparams.model_name_or_path)
    #self.tokenizer = T5Tokenizer.from_pretrained(hparams.tokenizer_name_or_path)
    self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    self.model = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")
    self.model.config.decoder_start_token_id = self.tokenizer.bos_token_id
    self.model.config.eos_token_id = self.tokenizer.eos_token_id
    self.model.config.pad_token_id = self.tokenizer.pad_token_id

    # sensible parameters for beam search
    self.model.config.vocab_size = self.model.config.decoder.vocab_size
    self.model.config.max_length = 142
    self.model.config.min_length = 56
    self.model.config.no_repeat_ngram_size = 3
    self.model.config.early_stopping = True
    self.model.config.length_penalty = 2.0
    self.model.config.num_beams = 4

  def is_logger(self):
    return self.trainer.proc_rank <= 0

  def forward(
      self, input_ids=None, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, lm_labels=None
  ):
    return self.model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=decoder_attention_mask,
        labels=lm_labels,
    )

  def _step(self, batch):
    lm_labels = batch["target_ids"]
    lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100

    outputs = self(
        input_ids=batch["source_ids"],
        attention_mask=batch["source_mask"],
        lm_labels=lm_labels,
        decoder_attention_mask=batch['target_mask'],
        decoder_input_ids=batch['target_ids']
    )

    loss = outputs[0]

    return loss

As above, I have updated the model, config, and tokenizer for bert2bert model. Also, sample input and target encoded pairs are as:

data = dataset[6]
print(tokenizer.decode(data['source_ids']))
print("**")
print(tokenizer.decode(data['target_ids']))

[CLS] context : in what spanish speaking north american country can you get a great cup of coffee? options : 1 : mildred's coffee shop 2 : mexico 3 : diner 4 : kitchen 5 : canteen < / s > [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] ** [CLS] 2 < / s > [SEP]

In the above example, 2 is indicating the label. And I run the model with the following parameters: {'output_dir': 't5_swag', 'model_name_or_path': 'bert2bert', 'tokenizer_name_or_path': 'bert-base', 'max_seq_length': 512, 'learning_rate': 3e-05, 'weight_decay': 0.0, 'adam_epsilon': 1e-08, 'warmup_steps': 0, 'train_batch_size': 8, 'eval_batch_size': 8, 'num_train_epochs': 4, 'gradient_accumulation_steps': 16, 'n_gpu': 1, 'early_stop_callback': False, 'fp_16': False, 'opt_level': 'O1', 'max_grad_norm': 1.0, 'seed': 42, 'data_dir': ''}

It finishes the execution with following loss values:

Validation sanity check: 100%
5/5 [00:03<00:00, 1.71it/s]
INFO:__main__:LOOKING AT  train
INFO:__main__:hello
Epoch 4: 100%
1370/1370 [35:51<00:00, 1.57s/it, loss=0.017, v_num=0, val_loss=0.268]
Validating: 100%
153/153 [01:31<00:00, 1.67it/s]
INFO:__main__:***** Validation results *****
INFO:__main__:avg_val_loss = tensor(0.2726, device='cuda:0')

INFO:__main__:loss = tensor(0.2695, device='cuda:0')

INFO:__main__:train_loss = tensor(0.2695, device='cuda:0')

INFO:__main__:val_loss = tensor(0.2726, device='cuda:0')

Validating: 100%
153/153 [01:31<00:00, 1.67it/s]
INFO:__main__:***** Validation results *****
INFO:__main__:avg_train_loss = tensor(1.1325, device='cuda:0')

INFO:__main__:avg_val_loss = tensor(0.2689, device='cuda:0')

INFO:__main__:epoch = 0

INFO:__main__:loss = tensor(0.2677, device='cuda:0')

INFO:__main__:train_loss = tensor(0.2677, device='cuda:0')

INFO:__main__:val_loss = tensor(0.2689, device='cuda:0')

Validating: 100%
153/153 [01:33<00:00, 1.64it/s]
INFO:__main__:***** Validation results *****
INFO:__main__:avg_train_loss = tensor(0.2719, device='cuda:0')

INFO:__main__:avg_val_loss = tensor(0.2686, device='cuda:0')

INFO:__main__:epoch = 1

INFO:__main__:loss = tensor(0.2674, device='cuda:0')

INFO:__main__:train_loss = tensor(0.2674, device='cuda:0')

INFO:__main__:val_loss = tensor(0.2686, device='cuda:0')

Validating: 100%
153/153 [01:33<00:00, 1.64it/s]
INFO:__main__:***** Validation results *****
INFO:__main__:avg_train_loss = tensor(0.2702, device='cuda:0')

INFO:__main__:avg_val_loss = tensor(0.2684, device='cuda:0')

INFO:__main__:epoch = 2

INFO:__main__:loss = tensor(0.2623, device='cuda:0')

INFO:__main__:train_loss = tensor(0.2623, device='cuda:0')

INFO:__main__:val_loss = tensor(0.2684, device='cuda:0')

The validation part:

model.model.eval()
outputs = []
targets = []
for batch in tqdm(loader):
  outs = model.model.generate(input_ids=batch['source_ids'].cuda(), 
                              attention_mask=batch['source_mask'].cuda())

  dec = [tokenizer.decode(ids) for ids in outs]
  target = [tokenizer.decode(ids) for ids in batch["target_ids"]]

  outputs.extend(dec)
  targets.extend(target)

metrics.accuracy_score(targets1, outputs1) 0.20065520065520065


The accuracy is too low. What can the reason be? Most probably I am missing something, but I could not find it.
patrickvonplaten commented 3 years ago

Hey @helloworld123-lab,

Thanks for the issue :-) Is there a specific reason to use Bert2bert for SWAG instead of just a BERT model?

helloworld123-lab commented 3 years ago

I am sorry for the issue :) actually i am new in this field. i just started working on models using transformers. T5 is a text-to-text model, I just wanted to try how it can perform with bert2bert. Is this the wrong approach to Swag?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.