microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.38k stars 4.11k forks source link

[BUG] hard coded to CUDA in builder.py #5274

Closed ehartford closed 6 months ago

ehartford commented 8 months ago

Ubuntu Server 20.04 AMD mi-210 (gfx90a) ROCm 6.0 torch-2.3.0.dev20240309+rocm6.0 DeepSpeed tag v0.14.0

$ python -c "import torch; print(torch.version.hip)"
6.0.32830-d62f6a171

DeepSpeed Zero1 was working but DeepSpeed Zero2 wasn't working.

[rank14]:   File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/accelerate/accelerator.py", line 1598, in _prepare_deepspeed
[rank14]:     optimizer = DeepSpeedCPUAdam(optimizer.param_groups, **defaults)
[rank14]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]:   File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/deepspeed-0.14.1+535a908f-py3.12.egg/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
[rank14]:     self.ds_opt_adam = CPUAdamBuilder().load()
[rank14]:                        ^^^^^^^^^^^^^^^^^^^^^^^
[rank14]:   File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/deepspeed-0.14.1+535a908f-py3.12.egg/deepspeed/ops/op_builder/builder.py", line 479, in load
[rank14]:     return self.jit_load(verbose)
[rank14]:            ^^^^^^^^^^^^^^^^^^^^^^
[rank14]:   File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/deepspeed-0.14.1+535a908f-py3.12.egg/deepspeed/ops/op_builder/builder.py", line 511, in jit_load
[rank14]:     cxx_args = self.strip_empty_entries(self.cxx_args())
[rank14]:                                         ^^^^^^^^^^^^^^^
[rank14]:   File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/deepspeed-0.14.1+535a908f-py3.12.egg/deepspeed/ops/op_builder/builder.py", line 766, in cxx_args
[rank14]:     CUDA_ENABLE = self.is_cuda_enable()
[rank14]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank14]:   File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/deepspeed-0.14.1+535a908f-py3.12.egg/deepspeed/ops/op_builder/builder.py", line 370, in is_cuda_enable
[rank14]:     assert_no_cuda_mismatch(self.name)
[rank14]:   File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/deepspeed-0.14.1+535a908f-py3.12.egg/deepspeed/ops/op_builder/builder.py", line 85, in assert_no_cuda_mismatch
[rank14]:     torch_cuda_version = ".".join(torch.version.cuda.split('.')[:2])

So, I delete DeepSpeed and install manually from source.

I set environment variables like this:

export GPU_ARCHS="gfx90a"
export ROCM_TARGET="gfx90a"
export HIP_PATH="/opt/rocm-6.0.0"
export ROCM_PATH="/opt/rocm-6.0.0"
export ROCM_HOME="/opt/rocm-6.0.0"
export HIP_PLATFORM=amd
export DS_BUILD_CPU_ADAM=1 
export TORCH_HIP_ARCH_LIST="gfx90a"

Then when I try to do DS_BUILD_CPU_ADAM=1 TORCH_HIP_ARCH_LIST="gfx90a" python setup.py install I get:

  File "/scratch/axolotl/DeepSpeed/op_builder/builder.py", line 85, in assert_no_cuda_mismatch
    torch_cuda_version = ".".join(torch.version.cuda.split('.')[:2])
                                  ^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'split'

I implemented a fix that unblocked me, but it was rejected. https://github.com/microsoft/DeepSpeed/pull/5249

loadams commented 7 months ago

Hi @ehartford

Do you get the same error if you run DS_BUILD_CPU_ADAM=1 TORCH_HIP_ARCH_LIST="gfx90a" pip install deepspeed?

loadams commented 7 months ago

Also FYI @jithunnair-amd and @rraminen

ehartford commented 7 months ago

Hi @ehartford

Do you get the same error if you run DS_BUILD_CPU_ADAM=1 TORCH_HIP_ARCH_LIST="gfx90a" pip install deepspeed?

Yes

loadams commented 7 months ago

Thanks, I'll run some tests on our nodes and see what I can come up with

loadams commented 7 months ago

Hi @ehartford - I'm not able to repro this on my side, though I'm only using ROCm 5.5 so far, I'll test with a newer version, and since I have slightly different hardware, I didn't specify TORCH_HIP_ARCH_LIST, but running DS_BUILD_CPU_ADAM=1 python setup.py install with the current head of master yields the following for me:

[2024-03-27 20:28:30,994] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn is not compatible with ROCM
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn is not compatible with ROCM
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/envs/ptca/lib/python3.8/site-packages/torch']
torch version .................... 1.13.1+rocm5.5
deepspeed install path ........... ['/opt/conda/envs/ptca/lib/python3.8/site-packages/deepspeed-0.14.1+cea5ea1e-py3.8-linux-x86_64.egg/deepspeed']
deepspeed info ................... 0.14.1+cea5ea1e, cea5ea1e, master
torch cuda version ............... None
torch hip version ................ 5.5.30201-c1741e9b
nvcc version ..................... None
deepspeed wheel compiled w. ...... torch 1.13, hip 5.5
shared memory (/dev/shm) size .... 199.00 GB
loadams commented 7 months ago

@ehartford - we are seeing a pip install of deepspeed pass in our CI, could you see what the difference is that you may be hitting?

ehartford commented 7 months ago

I will try to reproduce

loadams commented 7 months ago

Hi @ehartford - curious if there were any updates on this?

loadams commented 6 months ago

Hi @ehartford - closing this for now as can't reproduce - if you're still hitting this, please comment or open a new issue and link this one, thanks