Closed cgr71ii closed 2 months ago
Hi @cgr71ii 👋 Thank you for opening this issue 🤗
As shown in our documentation, the output of generate
is different from the output of forward
.
Namely, generate
's attention output is a tuple where each item is the attention output of one forward pass. In your example, if you replace e.g. translated_tokens.decoder_attentions
by translated_tokens.decoder_attentions[0]
you'll obtain the results you were expecting :)
Oh, ok! Thank you! :)
System Info
transformers
version: 4.43.3Who can help?
@gante
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hi!
If you set
trigger_error
toTrue
, you will see the differences for the decoder-attention (also for the cross-attention) shape when the translation is generated by model.generate() and model(). I don't know if this is a bug or just expected to be different. I have checked that the attention values are the same when all the information is structured the same way (there are differences in precision though, which I think is because model.generate() generates differently than model()).Expected behavior
I would expect to have the same format for the decoder and cross-attention shape regardless of where I use model.generate() or model(). Specifically, I would expect to obtain the result from model(), which for the decoder we obtain a matrix for each layer of the shape (batch_size, attention_heads, generated_tokens - 1, generated_tokens - 1).