huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.4k stars 26.37k forks source link

TypeError: forward() got an unexpected keyword argument 'attention_mask' #13812

Closed dpitawela closed 2 years ago

dpitawela commented 2 years ago

Environment info

@patrickvonplaten @patil-suraj

Information

I am using EncoderDecoderModel (encoder=TransfoXLModel, decoder=TransfoXLLMHeadModel) to train a generative model for text summarization using the 'multi_x_science_sum' huggingface dataset

When the training starts below error is given and training stops TypeError: forward() got an unexpected keyword argument 'attention_mask'

To reproduce


tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
txl2txl = EncoderDecoderModel.from_encoder_decoder_pretrained('transfo-xl-wt103', 'transfo-xl-wt103')

training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size, # 4
    per_device_eval_batch_size=batch_size,  # 4
    output_dir="output",
    logging_steps=2,
    save_steps=10,
    eval_steps=4,
    num_train_epochs=1
)

trainer = Seq2SeqTrainer(
    model=txl2txl,
    tokenizer=tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data_processed,
    eval_dataset=validation_data_processed
)
trainer.train()

TypeError: forward() got an unexpected keyword argument 'attention_mask'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\Users\DILEEP~1\AppData\Local\Temp/ipykernel_21416/3777690609.py in <module>
      7     eval_dataset=validation_data_processed
      8 )
----> 9 trainer.train()

~\.conda\envs\msresearch\lib\site-packages\transformers\trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1282                         tr_loss += self.training_step(model, inputs)
   1283                 else:
-> 1284                     tr_loss += self.training_step(model, inputs)
   1285                 self.current_flos += float(self.floating_point_ops(inputs))
   1286 

~\.conda\envs\msresearch\lib\site-packages\transformers\trainer.py in training_step(self, model, inputs)
   1787                 loss = self.compute_loss(model, inputs)
   1788         else:
-> 1789             loss = self.compute_loss(model, inputs)
   1790 
   1791         if self.args.n_gpu > 1:

~\.conda\envs\msresearch\lib\site-packages\transformers\trainer.py in compute_loss(self, model, inputs, return_outputs)
   1819         else:
   1820             labels = None
-> 1821         outputs = model(**inputs)
   1822         # Save past state if it exists
   1823         # TODO: this needs to be fixed and made cleaner later.

~\.conda\envs\msresearch\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

~\.conda\envs\msresearch\lib\site-packages\transformers\models\encoder_decoder\modeling_encoder_decoder.py in forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, **kwargs)
    423 
    424         if encoder_outputs is None:
--> 425             encoder_outputs = self.encoder(
    426                 input_ids=input_ids,
    427                 attention_mask=attention_mask,

~\.conda\envs\msresearch\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

TypeError: forward() got an unexpected keyword argument 'attention_mask'

As a sidenote, when I do the same task with following setting, the training starts without a problem tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased') bert2bert= EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased')

Please provide me assistance on how to do the training with TransformerXL to TransformerXL model

patil-suraj commented 2 years ago

We haven't really tested TransformerXL with EncoderDecoderModel so I'm not sure if it's will work or not since it's a bit of a different model. One major difference is that TransformerXL does not accept attetion_mask but in EncoderDecoderModel it's passed each time. You could try by removing attetion_mask, and see if it works.

Also, TransformerXL is a decoder-only model, so it might not give the best results as an encoder. And out of curiosity, is there any reason you want to try TransformerXL to TransformerXL model?

dpitawela commented 2 years ago

Thanks @patil-suraj very much for your response. The reason I am using TransformerXL to TransformerXL model is to enable model to process long sequences as I am trying to address a document summarization problem for a research. And its recurrent nature would be extremely beneficial for my task.

If possible please explain me

  1. Why TransformerXL does not accept an attention mask..?
  2. How to try by removing the attetion_mask. (if possible)
  3. As you are saying TransformerXL would not perform well as an encoder, any suggestion for a model to use as an encoder while having TranformerXL as a decoder.?
  4. If you are planning to test 'TransformerXL to TransformerXL' with EncoderDecoderModel in future..?
patil-suraj commented 2 years ago

Hey @dpitawela sorry to only answer now.

  1. I'm not very familiar with TransformerXL, so not sure about the attention_mask, @patrickvonplaten do you know why?
  2. instead of removing the attention mask etc, I will suggest using a different model which can process long sequence, will explain that below.
  3. Yes, IMO TransformerXL might not be a good choice for the encoder, since the model is trained as a decoder. Also, it is trained on WikiText-103 which is not a good enough dataset for pre-training. There two other models which can process long sequences. Longformer and BigBird.

You could use the longformer as encoder and bert/gpt2 as decoder or you could use the LED model.

And BigBird can be used as both encoder and decoder. So you could use bigbird2bigbird if the target sequences are also longer. Or bigbird to bert/gpt2.

  1. IMO transforxl is not a good choice for such task, so probably not.

Hope this helps :)

patrickvonplaten commented 2 years ago

Regarding the attention_mask, I'm actually also not sure why this is not used in TransfoXL. I think the reason could be that the model is always used in a causal_mask (LM objective) setting. This would mean that an attention_mask is unnecessary when training the model since inputs are padded to the right which are masked anyways.

Gently pinging @TevenLeScao here - maybe he has a better answer

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.