xpus not detected[BUG] - Githubissues

nevakrien commented 8 months ago

I am trying to get deepspeed working with distributed pytorch on xpus and it has been fairly challenging the setup script failed on my machine so I downgraded the version untill it worked so if this is not an issue on the new version I am sorry (but I am not really sure how to make this work since its the newst version of everything the intel cloud setup would allow which I believe is because the gpus are not yet compatible with new kernels versions)

Describe the bug deep speed would not detect my xpus

To Reproduce Steps to reproduce the behavior:

source /opt/intel/oneapi/setvars.sh
deepspeed --num_gpus=4 train.py --config=configs/train/mlm.yaml --deepspeed_config_file=configs/deepspeed/ds_config.json --dtype=bf16

causes this [2024-02-27 13:04:26,914] [WARNING] [runner.py:132:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Traceback (most recent call last): File "/home/sdp/.conda/envs/deepspeed_old/bin/deepspeed", line 6, in main() File "/home/sdp/.conda/envs/deepspeed_old/lib/python3.9/site-packages/deepspeed/launcher/runner.py", line 308, in main raise RuntimeError("Unable to proceed, no GPU resources available") RuntimeError: Unable to proceed, no GPU resources available

Expected behavior I am expecting it to run the user code

ds_report output

(deepspeed_old) sdp@gpu-node:~/contrastors$ ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn cuda is not available from torch
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/sdp/.conda/envs/deepspeed_old/lib/python3.9/site-packages/torch']
torch version .................... 1.13.0a0+git6c9b55e
torch cuda version ............... None
nvcc version .....................  [FAIL] cannot find CUDA_HOME via torch.utils.cpp_extension.CUDA_HOME=None 
deepspeed install path ........... ['/home/sdp/.conda/envs/deepspeed_old/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.5.9, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.13, cuda 0.0
(deepspeed_old) sdp@gpu-node:~/contrastors$

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: Ubuntu 22.04.3 LTS GPUs [0]: _DeviceProperties(name='Intel(R) Data Center GPU Max 1100', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=49152MB, max_compute_units=448) [1]: _DeviceProperties(name='Intel(R) Data Center GPU Max 1100', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=49152MB, max_compute_units=448) [2]: _DeviceProperties(name='Intel(R) Data Center GPU Max 1100', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=49152MB, max_compute_units=448) [3]: _DeviceProperties(name='Intel(R) Data Center GPU Max 1100', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=49152MB, max_compute_units=448)
Python version 3.9
Any other relevant info about your setup intel-extension-for-pytorch is installed

Launcher context lunching with deepspeed

Docker context no docker using native libararies for mkl etc via oneapi

Additional context deepspeed setup keepet crashing over torch.ccl specifcly because it wanted a higher version of mkl than I had on my machine which is why I was forced to downgrade.

tjruwase commented 8 months ago

@delock, can you please help?

@mrwyattii, FYI

delock commented 8 months ago

Hi @nevakrien, for troubleshooting we need to collect more information, can you run the following commands to check your environment? Thanks!

pip list|grep torch
pip list|grep intel
pip list|grep ccl
python -c "import torch;import intel_extension_for_pytorch;print(torch.xpu.is_available());print(torch.xpu.device_count())"

BTW this is the known good configuration where pytorch 2.1.0 is used. Which instruction do you follow to install the environment and what error message you encounter in your environment? The exact error message would help troubleshooting. https://github.com/intel/intel-extension-for-pytorch/blob/release/xpu/2.1.10/dependency_version.yml

loadams commented 8 months ago

Hi @nevakrien - following up on this, are you able to run the commands above?

nevakrien commented 8 months ago

no longer have access to that machine. I am on a new machine rn and I have changed my aproch to the problem I am trying to solve for reasons unrelated to deepspeed.

sorry for the late reply I just now saw the email

loadams commented 8 months ago

Thanks @nevakrien

microsoft / DeepSpeed

xpus not detected[BUG] #5202