[BUG] High memory usage on first GPU, despite perfectly-balanced stages in pipeline

Describe the bug

When using pipelining (with or without LayerSpec inside PipelineModule), the first GPU seems to have a considerably higher memory consumption, compared to the other ones. This is even visible on perfectly balanced ML models (like the model i attach below)

To Reproduce

extract the train loop in train.py, the ML model and dataset in benchmark.py, and the ds config in ds_config.json, all zipped inside code.zip
- this is a model of 2048 very-simple perfectly-balanced linear layers;
run deepspeed --num_gpus=8 train.py --deepspeed --deepspeed_config ds_config.json --pipeline_num_stages 8 --pipeline_spec_layers
- this runs the memory efficient (with SpecLayer) pipeline implementation. Remove --pipeline_spec_layers to run the non-SpecLayer implementation, and this issue is still visible;
- launch with --pipeline_num_stages X to define number of stages X $\in [2, 4, 8]$. The issue is still visible;

confirm that stages are memory-balanced:

[2023-10-09 10:57:02,180] [INFO] [engine.py:151:__init__] RANK=3 STAGE=3 LAYERS=512 [1536, 2048) STAGE_PARAMS=16842752 (16.843M)
[2023-10-09 10:57:02,180] [INFO] [engine.py:151:__init__] RANK=6 STAGE=6 LAYERS=512 [3072, 3584) STAGE_PARAMS=16842752 (16.843M)
[2023-10-09 10:57:02,180] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=512 [0, 512) STAGE_PARAMS=16842752 (16.843M)
[2023-10-09 10:57:02,180] [INFO] [engine.py:151:__init__] RANK=1 STAGE=1 LAYERS=512 [512, 1024) STAGE_PARAMS=16842752 (16.843M) 
[2023-10-09 10:57:02,180] [INFO] [engine.py:151:__init__] RANK=2 STAGE=2 LAYERS=512 [1024, 1536) STAGE_PARAMS=16842752 (16.843M) 
[2023-10-09 10:57:02,180] [INFO] [engine.py:151:__init__] RANK=5 STAGE=5 LAYERS=512 [2560, 3072) STAGE_PARAMS=16842752 (16.843M) 
[2023-10-09 10:57:02,180] [INFO] [engine.py:151:__init__] RANK=4 STAGE=4 LAYERS=512 [2048, 2560) STAGE_PARAMS=16842752 (16.843M)
[2023-10-09 10:57:02,180] [INFO] [engine.py:151:__init__] RANK=7 STAGE=7 LAYERS=512 [3584, 4096) STAGE_PARAMS=16842752 (16.843M)

on a different terminal, run watch nvidia-smi and check the memory usage across GPUs when you start training.

Expected behaviour

Running nvidia-smi should output the memory usage values. A big difference between GPU 0 and the others should be visible:

ds_report output

[2023-10-09 10:53:47,040] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  NVIDIA Inference is only supported on Pascal and newer architectures
transformer_inference .. [NO] ....... [NO]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['~/.local/lib/python3.10/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['~/.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.3, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 12.2
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
shared memory (/dev/shm) size .... 125.89 GB

System info:

Ubuntu 20.04.6 LTS
deepspeed=0.10.3, torch==2.0.1 and torch.version.cuda==11.7
1 single compute node, with 8x NVIDIA GeForce GTX TITAN X.

microsoft / DeepSpeed

[BUG] High memory usage on first GPU, despite perfectly-balanced stages in pipeline #4477