Closed dcaffo98 closed 1 year ago
Hi @dcaffo98, it'd be the best to file this directly with Deepspeed https://github.com/microsoft/DeepSpeed/issues since the issue is on the Deepspeed side.
In general such issues relate to code that changes the model after it was initialized, but there are many complex nuanced situations so it's best to talk to the DS developers directly.
I've filed the issue to the DS team as well. It may be worth noting that the error happens right after the first detected OVERFLOW in the run. However, multiple overflows occurred during the previous 24h of training (before resuming from the checkpoint).
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.27.4Who can help?
@stas00 may be the more suited for this since the issue is probably related to deepspeed
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Currently, I'm struggling to make a reproducible script, as the errors happens suddenly during training with ZeRO 3 stage activated and I'm using a custom dataset. The task is a contrastive loss pertaining. The backbone is the GLPN's encoder model, followed by a custom Attention Pooling module. The parameters causing the issues Deepspeed version is
0.9.1
The issue may be related to this, although the stack trace is not identical The error shows only when resuming from a checkpoint (resuming_from_checkpoint
=/path/to/checkpoint). I'm attaching the log output (error.txt
), along with the deepspeed ZeRO 3 configuration (config_adam_zero3.txt
) I'm using, plus the custom model implementation (modeling_custom_apr.txt
). config_adam_zero3.txt error.txt modeling_custom_apr.txtThis is the last part of the log where the error shows up
Expected behavior
After resuming from a checkpoint, the training should proceed fine, as it happens when training with the same setup from scratch.