I met an error when I use EncoderDecoderModel.

Captainr22 commented 3 years ago

Environment info

transformers version: 4.6.0
Platform:
Python version:
PyTorch version (GPU?): 1.7.1 cuda 9.2
Tensorflow version (GPU?):
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help @patrickvonplaten, @patil-suraj

Information

Model I am using EncoderDecoderModel:

When I am using EncoderDecoderModel, my code is here:

model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-large-uncased', 'gpt2')
model = model.cuda()
output = model(input_ids, input_mask, decoder_input_ids, decoder_input_mask, labels=labels)

I met an error like this:

Traceback (most recent call last):
  File "/home/jwli/ljw/study/test.py", line 68, in <module>
    output = model(input_ids, input_mask, decoder_input_ids, decoder_input_mask, labels=labels)
  File "/home/jwli/anaconda3/envs/study/lib/python3.9/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jwli/anaconda3/envs/study/lib/python3.9/site-packages/transformers/models/encoder_decoder/modeling_encoder_decoder.py", line 438, in forward
    decoder_outputs = self.decoder(
  File "/home/jwli/anaconda3/envs/study/lib/python3.9/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jwli/anaconda3/envs/study/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 941, in forward
    transformer_outputs = self.transformer(
  File "/home/jwli/anaconda3/envs/study/lib/python3.9/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jwli/anaconda3/envs/study/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 789, in forward
    outputs = block(
  File "/home/jwli/anaconda3/envs/study/lib/python3.9/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jwli/anaconda3/envs/study/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 339, in forward
    cross_attn_outputs = self.crossattention(
  File "/home/jwli/anaconda3/envs/study/lib/python3.9/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jwli/anaconda3/envs/study/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 239, in forward
    key, value = self.c_attn(encoder_hidden_states).split(self.split_size, dim=2)
  File "/home/jwli/anaconda3/envs/study/lib/python3.9/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jwli/anaconda3/envs/study/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1400, in forward
    x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
RuntimeError: mat1 dim 1 must match mat2 dim 0

But when I change 'bert-large-uncased' to 'bert-base-uncased', the code can run normally. Can you help me?@patrickvonplaten, @patil-suraj, @LysandreJik

NielsRogge commented 3 years ago

As you can see on the error, it has to do with the cross attention: the encoder_hidden_states (which are coming from BERT-large-uncased) have a dimensionality of 1024 (which I know by looking at the hidden_size attribute of the config file of bert-large-uncased). You can also check this by doing:

from transformers import BertConfig

config = BertConfig.from_pretrained('bert-large-uncased')
print(config.hidden_size)

or

from transformers import BertModel

model = BertModel.from_pretrained('bert-large-uncased')
print(model.config.hidden_size)

For the decoder, the queries have a dimensionality of 768 (again, you can see this by looking at the config file or using Python). There's a bit of inconsistency between the models, because for gpt2 the dimensionality is determined by the n_emb attribute (whereas it should ideally also be called hidden_size).

Digging into the code, it turns out the error happens because the cross attention layer is defined as a Conv1d layer as can be seen here. The in_channels are defined as 2 * self.embed_dim and the out_channels as self.embed_dim. So basically (2*768 = 1536, 768). However, one then applies this layer to the encoder_hidden_states, which have a dimensionality of 1024, so this will not work. You would have to update that line to:

self.c_attn = Conv1D(2*1024, 1024)

Captainr22 commented 3 years ago

Thank you for your help! @NielsRogge Actually, I know the error is caused by the dimension. In EncoderDecoderModel docs, it says "The EncoderDecoderModel can be used to initialize a sequence-to-sequence model with any pretrained autoencoding model as the encoder and any pretrained autoregressive model as the decoder.". So I think it will deal with the dimension matching problem automatically. Thank you for your help, I will follow your suggestions to modify my code! I will close the issue. Thank you!

MichaelJanz commented 3 years ago

Hi, thanks for your answer @NielsRogge. I am trying to do the same for a gpt-2 model with n_embd =1280, using also BertLage as encoder with hidden_size = 1024.

I saved my model and load it now by: model = AutoModelForSeq2SeqLM.from_pretrained(...)

When I started to finetune my model, I reached the same error as OP reported.

I followed your advice afterwards, but this resulted in: size mismatch for decoder.transformer.h.0.crossattention.c_attn.weight: copying a param with shape torch.Size([1280, 2560]) from checkpoint, the shape in current model is torch.Size([1024, 2048]). (many more of these line with h.x increasing. Removed them for readability) Am I missing something here? It looks like the model does not accept the new dimension. Could you give me an advice how to solve that and perhaps what I am missing here?

Thanks alot!

MichaelJanz commented 3 years ago

Well as I wrote my comment, the solution came into my mind already: After changing the code you need to recreate the model. As the doc says: " Cross-attention layers are automatically added to the decoder and should be fine-tuned on a downstream generative task, like summarization", therefore cross_attention layers are automatically added at model creation. Dont try to fit random initiated weights into a wrong shape ;)

Sry for taking your time

huggingface / transformers