Closed dpitawela closed 2 years ago
We haven't really tested TransformerXL
with EncoderDecoderModel
so I'm not sure if it's will work or not since it's a bit of a different model. One major difference is that TransformerXL
does not accept attetion_mask
but in EncoderDecoderModel
it's passed each time. You could try by removing attetion_mask
, and see if it works.
Also, TransformerXL
is a decoder-only model, so it might not give the best results as an encoder.
And out of curiosity, is there any reason you want to try TransformerXL to TransformerXL model?
Thanks @patil-suraj very much for your response. The reason I am using TransformerXL to TransformerXL model is to enable model to process long sequences as I am trying to address a document summarization problem for a research. And its recurrent nature would be extremely beneficial for my task.
If possible please explain me
Hey @dpitawela sorry to only answer now.
attention_mask
, @patrickvonplaten do you know why?TransformerXL
might not be a good choice for the encoder, since the model is trained as a decoder. Also, it is trained on WikiText-103
which is not a good enough dataset for pre-training. There two other models which can process long sequences. Longformer
and BigBird
. You could use the longformer as encoder and bert/gpt2 as decoder or you could use the LED model.
And BigBird can be used as both encoder and decoder. So you could use bigbird2bigbird
if the target sequences are also longer. Or bigbird
to bert/gpt2.
Hope this helps :)
Regarding the attention_mask
, I'm actually also not sure why this is not used in TransfoXL. I think the reason could be that the model is always used in a causal_mask (LM objective) setting. This would mean that an attention_mask is unnecessary when training the model since inputs are padded to the right which are masked anyways.
Gently pinging @TevenLeScao here - maybe he has a better answer
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Environment info
transformers
version: 4.10.0@patrickvonplaten @patil-suraj
Information
I am using EncoderDecoderModel (encoder=TransfoXLModel, decoder=TransfoXLLMHeadModel) to train a generative model for text summarization using the 'multi_x_science_sum' huggingface dataset
When the training starts below error is given and training stops TypeError: forward() got an unexpected keyword argument 'attention_mask'
To reproduce
As a sidenote, when I do the same task with following setting, the training starts without a problem tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased') bert2bert= EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased')
Please provide me assistance on how to do the training with TransformerXL to TransformerXL model