When using pipelining (with or without LayerSpec inside PipelineModule), the first GPU seems to have a considerably higher memory consumption, compared to the other ones. This is even visible on perfectly balanced ML models (like the model i attach below)
To Reproduce
extract the train loop in train.py, the ML model and dataset in benchmark.py, and the ds config in ds_config.json, all zipped inside code.zip
this is a model of 2048 very-simple perfectly-balanced linear layers;
run deepspeed --num_gpus=8 train.py --deepspeed --deepspeed_config ds_config.json --pipeline_num_stages 8 --pipeline_spec_layers
this runs the memory efficient (with SpecLayer) pipeline implementation. Remove --pipeline_spec_layers to run the non-SpecLayer implementation, and this issue is still visible;
launch with --pipeline_num_stages X to define number of stages X $\in [2, 4, 8]$. The issue is still visible;
on a different terminal, run watch nvidia-smi and check the memory usage across GPUs when you start training.
Expected behaviour
Running nvidia-smi should output the memory usage values. A big difference between GPU 0 and the others should be visible:
ds_report output
[2023-10-09 10:53:47,040] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] NVIDIA Inference is only supported on Pascal and newer architectures
transformer_inference .. [NO] ....... [NO]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['~/.local/lib/python3.10/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['~/.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.3, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 12.2
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
shared memory (/dev/shm) size .... 125.89 GB
System info:
Ubuntu 20.04.6 LTS
deepspeed=0.10.3, torch==2.0.1 and torch.version.cuda==11.7
1 single compute node, with 8x NVIDIA GeForce GTX TITAN X.
Describe the bug
When using pipelining (with or without
LayerSpec
insidePipelineModule
), the first GPU seems to have a considerably higher memory consumption, compared to the other ones. This is even visible on perfectly balanced ML models (like the model i attach below)To Reproduce
train.py
, the ML model and dataset inbenchmark.py
, and the ds config inds_config.json
, all zipped inside code.zipdeepspeed --num_gpus=8 train.py --deepspeed --deepspeed_config ds_config.json --pipeline_num_stages 8 --pipeline_spec_layers
SpecLayer
) pipeline implementation. Remove--pipeline_spec_layers
to run the non-SpecLayer
implementation, and this issue is still visible;--pipeline_num_stages X
to define number of stagesX
$\in [2, 4, 8]$. The issue is still visible;watch nvidia-smi
and check the memory usage across GPUs when you start training.Expected behaviour
Running
nvidia-smi
should output the memory usage values. A big difference between GPU 0 and the others should be visible:ds_report output
System info:
deepspeed=0.10.3
,torch==2.0.1
andtorch.version.cuda==11.7