Open LoggerHead22 opened 7 months ago
@LoggerHead22, we will look into this issue. As an alternative (stopgap measure), please consider using hpZ component of ZeRO++.
Is there any update on this, @samadejacobs ?
@samadejacobs also curious about an update - seeing the same issue when using pytorch 2.2 + cuda 12 + nvidia gpus
+1 Also same issue pytorch 2.4, cuda 12.6, p4d.24xlarge
Describe the bug Hi, i'm trying to run pretraining gpt model with Megatron-DeepSpeed pipeline and Zero-3 + Mics sharding strategy, but got next log:
If split model across 2 nodes ("mics_shard_size": 16) and set "mics_hierarchical_params_gather": true this error appears explicitly, without warning:
Despite the fact that in the first case learning formally continues, in the first 10-20 iteration the model catches a lot of overflow by loss scaler and stops learning normally. At the same time, pure Zero-3 learns without errors, overflow and other problems. The error occurs on any number of nodes, even on single node.
I am using my fork of the Megatron-DeepSpeed framework with minimal changes to run with mics, which unfortunately I cannot share. But I am sure that this is not a problem of the training code, because all other Zero modes work correctly.
My deepspeed config:
ds_report output
System info:
Launcher context I'm launching my experiment with the torchrun
Can someone suggest a reason for this behavior? Judging by issues, this behavior is very rare. Is this a problem of MiCS logic, my environment, or something else?