Open chanind opened 1 month ago
Hi @chanind, thanks for reporting the issue!
This is indeed a problem of scaled_dot_product_attention in PyTorch
The cause of nan
is how softmax is computed over full-masked rows in the attention mask and I hope it will be fixed in future versions of PyTorch, here is a related PR
Also, a similar issue has been reported previously
Besides switching to eager
/flash_attnetion_2
you could also try
Use float16
dtype.
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-2b", device_map="auto", torch_dtype=torch.float16
)
Modify attn_mask
min value.
As suggested in the above issue, we can modify attn_mask
to use another min value instead of torch.finfo(dtype).min
, for example, torch.finfo(dtype).min / 2
. To apply this, find min_dtype = torch.finfo(dtype).min
in gemma modeling file and replace it with torch.finfo(dtype).min / 2
.
Meanwhile, we will try to fix it on our side, thanks!
More than this, it's expected as the sdpa
path does not support logit soft-capping (For Gemma2).
We do already take into account the sdpa bug when creating the mask @qubvel see here: https://github.com/huggingface/transformers/blob/c1aa0edb48217f416f4bbe6e3a9db1500284513b/src/transformers/models/llama/modeling_llama.py#L1063-L1072
Which should be propagated to Gemma2. (it was not there for some reason my bad here)
Related to #31303
@ArthurZucker thanks for the updated info!
Hi, I have met a problem, when I finetune Gemma2-2b using trainsformers.trainer, I find the lr is always 0, and grad_norm is nan: so what's wrong? I using the same code to finetune llama3-8b and it works well. This is my settings:
Same issue here running the code for hooking the activations of the model. Using float16 made it work.
Hey! Make sure you are using eager
or flash_attention_2
not sdpa
!
Hi, I have met a problem, when I finetune Gemma2-2b using trainsformers.trainer, I find the lr is always 0, and grad_norm is nan: so what's wrong? I using the same code to finetune llama3-8b and it works well. This is my settings:
hi i have the same issue. How do you solve it? 😊
Hi, I have met a problem, when I finetune Gemma2-2b using trainsformers.trainer, I find the lr is always 0, and grad_norm is nan: so what's wrong? I using the same code to finetune llama3-8b and it works well. This is my settings:
hi i have the same issue. How do you solve it? 😊
Hi, I just use eager instead of sdpa like this: model = AutoModelForCausalLM.from_pretrained(args.prune_model_path, trust_remote_code=True, device_map=device_map, attn_implementation="eager" )
System Info
Python 3.10 Transformers 4.43.3 Linux (Colab notebook)
Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The default gemma 2 2b attn results in NaN for padding tokens. A simple demo can be seen below (also reproduced in this colab notebook):
This returns the following
This can be fixed by changing the
attn_implementation
to anything exceptsdpa
Expected behavior
Using padding should not result in NaN for normal inputs to gemma 2 2b