microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.49k stars 4.12k forks source link

[BUG] High memory usage on first GPU, despite perfectly-balanced stages in pipeline #4477

Open brunomaga opened 1 year ago

brunomaga commented 1 year ago

Describe the bug

When using pipelining (with or without LayerSpec inside PipelineModule), the first GPU seems to have a considerably higher memory consumption, compared to the other ones. This is even visible on perfectly balanced ML models (like the model i attach below)

To Reproduce

Expected behaviour

Running nvidia-smi should output the memory usage values. A big difference between GPU 0 and the others should be visible:

image

ds_report output

[2023-10-09 10:53:47,040] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  NVIDIA Inference is only supported on Pascal and newer architectures
transformer_inference .. [NO] ....... [NO]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['~/.local/lib/python3.10/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['~/.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.3, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 12.2
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
shared memory (/dev/shm) size .... 125.89 GB

System info:

Armanasq commented 5 months ago

I have the same issue here, is there any solution for it?

Jiachen2cc commented 3 months ago

Same issue here, any solutions now? :)