microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.82k stars 4.05k forks source link

[BUG] Problems with MiCS training #5080

Open LoggerHead22 opened 7 months ago

LoggerHead22 commented 7 months ago

Describe the bug Hi, i'm trying to run pretraining gpt model with Megatron-DeepSpeed pipeline and Zero-3 + Mics sharding strategy, but got next log:

WARNING: Runtime Error while waiting the collective all-gather, possibly due to the _IllegalWork
[2024-02-02 16:28:29,946] [INFO] [logging.py:96:log_dist] [Rank 0] Error message: Illegal to call wait on IllegalWork object

If split model across 2 nodes ("mics_shard_size": 16) and set "mics_hierarchical_params_gather": true this error appears explicitly, without warning:

File "/usr/local/lib64/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1650, in __getattribute__
  raise RuntimeError(f"Illegal to call {name} on IllegalWork object")
RuntimeError: Illegal to call wait on IllegalWork object

Despite the fact that in the first case learning formally continues, in the first 10-20 iteration the model catches a lot of overflow by loss scaler and stops learning normally. At the same time, pure Zero-3 learns without errors, overflow and other problems. The error occurs on any number of nodes, even on single node.

I am using my fork of the Megatron-DeepSpeed framework with minimal changes to run with mics, which unfortunately I cannot share. But I am sure that this is not a problem of the training code, because all other Zero modes work correctly.

My deepspeed config:

{
  "train_batch_size" : 8,
  "train_micro_batch_size_per_gpu": 1536,
  "steps_per_print": 10,
  "zero_optimization": {
    "stage": 3,
    "reduce_scatter" : true,
    "overlap_comm": true,
    "allgather_partitions" : true,
    "reduce_bucket_size": 5e8,
    "allgather_bucket_size" : 5e8,
    "stage3_max_live_parameters" : 1e9,
    "stage3_prefetch_bucket_size" : 5e8,
    "stage3_max_reuse_distance" : 1e9,
    "stage3_param_persistence_threshold": 1e6,
    "mics_shard_size": 8,
    "mics_hierarchical_params_gather": false
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 12,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "comms_logger": {
    "enabled": true,
    "verbose": true,
    "prof_all": true,
    "debug": false
  },
  "gradient_clipping" : 1.0,
  "wall_clock_breakdown" : true
}

ds_report output

DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib64/python3.9/site-packages/torch']
torch version .................... 2.1.0+rocm5.6
deepspeed install path ........... ['/usr/local/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.13.1, unknown, unknown
torch cuda version ............... None
torch hip version ................ 5.6.31061-8c743ae5d
nvcc version ..................... None
deepspeed wheel compiled w. ...... torch 2.1, hip 5.6

System info:

Launcher context I'm launching my experiment with the torchrun

Can someone suggest a reason for this behavior? Judging by issues, this behavior is very rare. Is this a problem of MiCS logic, my environment, or something else?

samadejacobs commented 7 months ago

@LoggerHead22, we will look into this issue. As an alternative (stopgap measure), please consider using hpZ component of ZeRO++.

zijwang commented 5 months ago

Is there any update on this, @samadejacobs ?

dementrock commented 3 months ago

@samadejacobs also curious about an update - seeing the same issue when using pytorch 2.2 + cuda 12 + nvidia gpus

evkogs commented 1 month ago

+1 Also same issue pytorch 2.4, cuda 12.6, p4d.24xlarge