Open fxmarty opened 9 months ago
FYI @rraminen and @jithunnair-amd
Hello, I have the same problem when using torch==2.2.0+rocm5.6 & deepspeed==0.12.4. How can this be solved? Any help would be greatly appreciated.
@xxtars - can you try with a newer version of DeepSpeed and let me know if you still repro this? Specifically 0.14.0?
Thank you for your reply! I'm not familiar with DeepSpeed and ZeRO. What I am doing is training using zero2_offload, and I encounter this issue during compilation. The versions I have tried are:
Did I miss setting some environment variable or others? I am also new to ROCm.
There is my ds_report output. Any help would be greatly appreciated.
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn is not compatible with ROCM
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/scratch/-/miniconda3/envs/llamavid_rocm5.6/lib/python3.10/site-packages/torch']
torch version .................... 2.2.2+rocm5.6
deepspeed install path ........... ['/scratch/-/miniconda3/envs/llamavid_rocm5.6/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.14.0, unknown, unknown
torch cuda version ............... None
torch hip version ................ 5.6.31061-8c743ae5d
nvcc version ..................... None
deepspeed wheel compiled w. ...... torch 2.2, hip 5.6
shared memory (/dev/shm) size .... 500.00 GB
@xxtars - can you please share the command you are running that produces the error?
@xxtars Sorry for the lack of updates. We are actively looking into this issue and should have an update by next week.
@loadams Initially, I was using transformers with deepspeed zero2_offload to train llama-vid. Upon starting the compilation, this warning appeared same as that in the issue. After seeing this issue, I conducted the same test and received the same warning.
import torch
from deepspeed.ops.adam import DeepSpeedCPUAdam
fused_adam = DeepSpeedCPUAdam([torch.rand(10)])
yields [WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
@xxtars Sorry for the lack of updates. We are actively looking into this issue and should have an update by next week.
Thanks for any help.
Hello, any update on this ? I am also working on MI250 GPUs (4) and I am using HF transformers with deepspeed zero 2 stage to train mistral-7B.
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fp_quantizer ........... [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn is not compatible with ROCM
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch version .................... 2.3.1+rocm5.7
deepspeed info ................... 0.14.4, unknown, unknown
torch cuda version ............... None
torch hip version ................ 5.7.31921-d1770ee1b
nvcc version ..................... None
deepspeed wheel compiled w. ...... torch 2.3, hip 5.7
Thank you in advance :)
Describe the bug Hi,
yields
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
on a ROCm install of PyTorch & DeepSpeed. Is this expected?Thank you!
To Reproduce Use torch==2.1.1+rocm5.6 & deepspeed==0.12.4.
Expected behavior A clear and concise description of what you expected to happen.
ds_report output
System info (please complete the following information):