Closed meguruin closed 2 years ago
For ONNX related issues, gently pinging @lewtun here
@patrickvonplaten, thanks. I think that this issue relates not only using ONNX because it is contradiction between document and actual outputs.
Ok, I see so the problem is the documentation here? There might very well be a bug in the documentation... The following looks correct to me:
print(len(decoder_outputs["past_key_values"]))
# 12
print(len(decoder_outputs["past_key_values"][0]))
# 4
print(decoder_outputs["past_key_values"][0][0].shape)
# torch.Size([1, 16, 4, 64])
We have 12 layers. ProphetNet is an encoder-decoder model so it caches 4 tensors (decoder value, decoder key as well as projected encoder key and projected encoder value matrices for the cross attention layer).
Would you like to open a PR maybe to fix the documentation is you've found the bug it seems?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Environment info
transformers
version: 4.15.0Who can help
@patrickvonplaten, @LysandreJik
Information
Model I am using (Bert, XLNet ...): ProphetNet
The problem arises when using:
When I try to convert ProphetNetModel to onnx, I found that "past_key_values" of decoder output is not the same shape as in official document. The description of
ProphetNetDecoderModelOutput
says that:However, I get past_key_values just like
BaseModelOutputWithPastAndCrossAttentions
.The tasks I am working on is:
To reproduce
Expected behavior
len(decoder_outputs["past_key_values"]) == 12
anddecoder_outputs["past_key_values"][0].shape = (2, batch_size, num_attn_heads, decoder_sequence_length, embed_size_per_head))