Closed Captainr22 closed 3 years ago
As you can see on the error, it has to do with the cross attention: the encoder_hidden_states
(which are coming from BERT-large-uncased) have a dimensionality of 1024 (which I know by looking at the hidden_size
attribute of the config file of bert-large-uncased). You can also check this by doing:
from transformers import BertConfig
config = BertConfig.from_pretrained('bert-large-uncased')
print(config.hidden_size)
or
from transformers import BertModel
model = BertModel.from_pretrained('bert-large-uncased')
print(model.config.hidden_size)
For the decoder, the queries
have a dimensionality of 768 (again, you can see this by looking at the config file or using Python). There's a bit of inconsistency between the models, because for gpt2 the dimensionality is determined by the n_emb
attribute (whereas it should ideally also be called hidden_size
).
Digging into the code, it turns out the error happens because the cross attention layer is defined as a Conv1d
layer as can be seen here. The in_channels
are defined as 2 * self.embed_dim
and the out_channels
as self.embed_dim
. So basically (2*768 = 1536, 768). However, one then applies this layer to the encoder_hidden_states
, which have a dimensionality of 1024, so this will not work. You would have to update that line to:
self.c_attn = Conv1D(2*1024, 1024)
Thank you for your help! @NielsRogge Actually, I know the error is caused by the dimension. In EncoderDecoderModel docs, it says "The EncoderDecoderModel can be used to initialize a sequence-to-sequence model with any pretrained autoencoding model as the encoder and any pretrained autoregressive model as the decoder.". So I think it will deal with the dimension matching problem automatically. Thank you for your help, I will follow your suggestions to modify my code! I will close the issue. Thank you!
Hi, thanks for your answer @NielsRogge. I am trying to do the same for a gpt-2 model with n_embd =1280, using also BertLage as encoder with hidden_size = 1024.
I saved my model and load it now by:
model = AutoModelForSeq2SeqLM.from_pretrained(...)
When I started to finetune my model, I reached the same error as OP reported.
I followed your advice afterwards, but this resulted in:
size mismatch for decoder.transformer.h.0.crossattention.c_attn.weight: copying a param with shape torch.Size([1280, 2560]) from checkpoint, the shape in current model is torch.Size([1024, 2048]). (many more of these line with h.x increasing. Removed them for readability)
Am I missing something here? It looks like the model does not accept the new dimension. Could you give me an advice how to solve that and perhaps what I am missing here?
Thanks alot!
Well as I wrote my comment, the solution came into my mind already: After changing the code you need to recreate the model. As the doc says: " Cross-attention layers are automatically added to the decoder and should be fine-tuned on a downstream generative task, like summarization", therefore cross_attention layers are automatically added at model creation. Dont try to fit random initiated weights into a wrong shape ;)
Sry for taking your time
Environment info
transformers
version: 4.6.0Who can help @patrickvonplaten, @patil-suraj
Information
Model I am using EncoderDecoderModel:
When I am using EncoderDecoderModel, my code is here:
I met an error like this:
But when I change 'bert-large-uncased' to 'bert-base-uncased', the code can run normally. Can you help me?@patrickvonplaten, @patil-suraj, @LysandreJik