Closed ehartford closed 6 months ago
Hi @ehartford
Do you get the same error if you run DS_BUILD_CPU_ADAM=1 TORCH_HIP_ARCH_LIST="gfx90a" pip install deepspeed
?
Also FYI @jithunnair-amd and @rraminen
Hi @ehartford
Do you get the same error if you run
DS_BUILD_CPU_ADAM=1 TORCH_HIP_ARCH_LIST="gfx90a" pip install deepspeed
?
Yes
Thanks, I'll run some tests on our nodes and see what I can come up with
Hi @ehartford - I'm not able to repro this on my side, though I'm only using ROCm 5.5 so far, I'll test with a newer version, and since I have slightly different hardware, I didn't specify TORCH_HIP_ARCH_LIST
, but running DS_BUILD_CPU_ADAM=1 python setup.py install
with the current head of master yields the following for me:
[2024-03-27 20:28:30,994] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn is not compatible with ROCM
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn is not compatible with ROCM
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/envs/ptca/lib/python3.8/site-packages/torch']
torch version .................... 1.13.1+rocm5.5
deepspeed install path ........... ['/opt/conda/envs/ptca/lib/python3.8/site-packages/deepspeed-0.14.1+cea5ea1e-py3.8-linux-x86_64.egg/deepspeed']
deepspeed info ................... 0.14.1+cea5ea1e, cea5ea1e, master
torch cuda version ............... None
torch hip version ................ 5.5.30201-c1741e9b
nvcc version ..................... None
deepspeed wheel compiled w. ...... torch 1.13, hip 5.5
shared memory (/dev/shm) size .... 199.00 GB
@ehartford - we are seeing a pip install of deepspeed pass in our CI, could you see what the difference is that you may be hitting?
I will try to reproduce
Hi @ehartford - curious if there were any updates on this?
Hi @ehartford - closing this for now as can't reproduce - if you're still hitting this, please comment or open a new issue and link this one, thanks
Ubuntu Server 20.04 AMD mi-210 (gfx90a) ROCm 6.0 torch-2.3.0.dev20240309+rocm6.0 DeepSpeed tag v0.14.0
DeepSpeed Zero1 was working but DeepSpeed Zero2 wasn't working.
So, I delete DeepSpeed and install manually from source.
I set environment variables like this:
Then when I try to do
DS_BUILD_CPU_ADAM=1 TORCH_HIP_ARCH_LIST="gfx90a" python setup.py install
I get:I implemented a fix that unblocked me, but it was rejected. https://github.com/microsoft/DeepSpeed/pull/5249