Open Leoyzen opened 5 months ago
c @ArthurZucker @fxmarty
yep, this was always there I think and related to the mask creation that probably overflows. Do you want to open a PR for a fix? 🤗
@Leoyzen can you share the repro tensors and/or reproduction with a transformers example?
@Leoyzen can you share the repro tensors and/or reproduction with a transformers example?
The reproduce_ata.pt
which dumped from the private code repo is quite large(with (torch.Size([31, 1, 2000, 2000])
) and almost 1GB).
We use Bert from transformers and weights from (stella-v2)[https://huggingface.co/infgrad/stella-large-zh-v2] to do some finetuning work.
Training with bert large and bfloat16 should reproduce the bug.
Mmmm if this is still a problem, we need to propagate the changes from #32227 to bert and bert sdpa!
Leaving it to the community unless I get time !
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
the output should without nan when using bfloat16 and sdap enabled.
I think it is safe to use
torch.finfo(dtype).min / 2
instead oftorch.finfo(dtype.min
.