Closed chanind closed 1 month ago
Hi @chanind, thanks for reporting the issue!
This is indeed a problem of scaled_dot_product_attention in PyTorch
The cause of nan
is how softmax is computed over full-masked rows in the attention mask and I hope it will be fixed in future versions of PyTorch, here is a related PR
Also, a similar issue has been reported previously
Besides switching to eager
/flash_attnetion_2
you could also try
Use float16
dtype.
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-2b", device_map="auto", torch_dtype=torch.float16
)
Modify attn_mask
min value.
As suggested in the above issue, we can modify attn_mask
to use another min value instead of torch.finfo(dtype).min
, for example, torch.finfo(dtype).min / 2
. To apply this, find min_dtype = torch.finfo(dtype).min
in gemma modeling file and replace it with torch.finfo(dtype).min / 2
.
Meanwhile, we will try to fix it on our side, thanks!
More than this, it's expected as the sdpa
path does not support logit soft-capping (For Gemma2).
We do already take into account the sdpa bug when creating the mask @qubvel see here: https://github.com/huggingface/transformers/blob/c1aa0edb48217f416f4bbe6e3a9db1500284513b/src/transformers/models/llama/modeling_llama.py#L1063-L1072
Which should be propagated to Gemma2. (it was not there for some reason my bad here)
Related to #31303
@ArthurZucker thanks for the updated info!
Hi, I have met a problem, when I finetune Gemma2-2b using trainsformers.trainer, I find the lr is always 0, and grad_norm is nan: so what's wrong? I using the same code to finetune llama3-8b and it works well. This is my settings:
Same issue here running the code for hooking the activations of the model. Using float16 made it work.
Hey! Make sure you are using eager
or flash_attention_2
not sdpa
!
Hi, I have met a problem, when I finetune Gemma2-2b using trainsformers.trainer, I find the lr is always 0, and grad_norm is nan: so what's wrong? I using the same code to finetune llama3-8b and it works well. This is my settings:
hi i have the same issue. How do you solve it? 😊
Hi, I have met a problem, when I finetune Gemma2-2b using trainsformers.trainer, I find the lr is always 0, and grad_norm is nan: so what's wrong? I using the same code to finetune llama3-8b and it works well. This is my settings:
hi i have the same issue. How do you solve it? 😊
Hi, I just use eager instead of sdpa like this: model = AutoModelForCausalLM.from_pretrained(args.prune_model_path, trust_remote_code=True, device_map=device_map, attn_implementation="eager" )
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Python 3.10 Transformers 4.43.3 Linux (Colab notebook)
Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The default gemma 2 2b attn results in NaN for padding tokens. A simple demo can be seen below (also reproduced in this colab notebook):
This returns the following
This can be fixed by changing the
attn_implementation
to anything exceptsdpa
Expected behavior
Using padding should not result in NaN for normal inputs to gemma 2 2b