Closed godjw closed 2 months ago
Hi @godjw and thanks for the issue, indeed, it seems -at least from my local setup- that left-padding has issues in float16
precision, does that match your finetuning setup?
The issue does seem to arise since v4.40 where we introduced RecurrentGemma, despite having a test in modeling that does check for left-padded generation. ccing @ArthurZucker for vis and will take a look tomorrow!
@molbap
Thanks for the reply! and I am using the float32
precision for my finetuning setup.
But I think the precisions will not make much difference for the padding mask.
Currently the finetuning only works with right padding
This seems expected in terms of mask creation:
-3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38
, but this is not supported in sdpa so we switch to full attention, -0.0000e+00, -0.0000e+00, -0.0000e+00, -0.0000e+00
which should not affect the resultoutput_attention=True
when training / not using sdpa
in the attn_implementation
then this is not expected. Now the causal mask is not going to be an issue TBH, but the way the model works might. Since there is a convolution layer in the RNN part, we might need to make it ignore padding. One more think is which padding token is used, and if the embedding was resized for it, the embedding value needs to be an average of all the model's embeddings
@ArthurZucker
I'm sorry, but I'm having some difficulty understanding a few points. Could you please clarify why the first two rows becoming -3.4028e+38, -3.4028e+38, -3.4028e+38, -3.4028e+38
would not work with sdpa
and why switching to full attention wouldn't affect the result? It seems quite unusual to me that the padding tokens would be attended to.
I was able to train and inference with the model by giving the model no attention mask and just using the default causal mask, which doesn't care about the padding tokens- but with the left padded attention mask only training was possible and the inference results were really bad.
Additionally, I wanted to mention that I am using sdpa
because it is the default and only supported type of attention in the RecurrentGemmaModel
.
Thank you very much for your help!
This piece of code in modeling_llama.py
should help you:
# Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
# using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
# Details: https://github.com/pytorch/pytorch/issues/110213
causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
🤗
Thank you very much! Then the problem could be because the padding tokens are not considered properly, as you've mentioned.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.40.0Who can help?
@ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am currently trying to finetune the RecurrentGemmaModel.
However, when I train the model with left padding the training outputs where quite strange so I tried to debug it.
I checked this part about the causal mask. https://github.com/huggingface/transformers/blob/96eb06286b63c9c93334d507e632c175d6ba8b28/src/transformers/models/recurrent_gemma/modeling_recurrent_gemma.py#L753-L779
When I put a left padded dummy attention mask like below, the attention mask looks strange.
However, giving the right padded attention mask produces the right attention mask.
Expected behavior
I think The expected first mask which is left padded should look like below, because the padded parts should not be attended.