Closed nevakrien closed 8 months ago
@delock, can you please help?
@mrwyattii, FYI
Hi @nevakrien, for troubleshooting we need to collect more information, can you run the following commands to check your environment? Thanks!
pip list|grep torch
pip list|grep intel
pip list|grep ccl
python -c "import torch;import intel_extension_for_pytorch;print(torch.xpu.is_available());print(torch.xpu.device_count())"
BTW this is the known good configuration where pytorch 2.1.0 is used. Which instruction do you follow to install the environment and what error message you encounter in your environment? The exact error message would help troubleshooting. https://github.com/intel/intel-extension-for-pytorch/blob/release/xpu/2.1.10/dependency_version.yml
Hi @nevakrien - following up on this, are you able to run the commands above?
no longer have access to that machine. I am on a new machine rn and I have changed my aproch to the problem I am trying to solve for reasons unrelated to deepspeed.
sorry for the late reply I just now saw the email
Thanks @nevakrien
I am trying to get deepspeed working with distributed pytorch on xpus and it has been fairly challenging the setup script failed on my machine so I downgraded the version untill it worked so if this is not an issue on the new version I am sorry (but I am not really sure how to make this work since its the newst version of everything the intel cloud setup would allow which I believe is because the gpus are not yet compatible with new kernels versions)
Describe the bug deep speed would not detect my xpus
To Reproduce Steps to reproduce the behavior:
causes this [2024-02-27 13:04:26,914] [WARNING] [runner.py:132:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Traceback (most recent call last): File "/home/sdp/.conda/envs/deepspeed_old/bin/deepspeed", line 6, in
main()
File "/home/sdp/.conda/envs/deepspeed_old/lib/python3.9/site-packages/deepspeed/launcher/runner.py", line 308, in main
raise RuntimeError("Unable to proceed, no GPU resources available")
RuntimeError: Unable to proceed, no GPU resources available
Expected behavior I am expecting it to run the user code
ds_report output
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
OS: Ubuntu 22.04.3 LTS GPUs [0]: _DeviceProperties(name='Intel(R) Data Center GPU Max 1100', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=49152MB, max_compute_units=448) [1]: _DeviceProperties(name='Intel(R) Data Center GPU Max 1100', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=49152MB, max_compute_units=448) [2]: _DeviceProperties(name='Intel(R) Data Center GPU Max 1100', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=49152MB, max_compute_units=448) [3]: _DeviceProperties(name='Intel(R) Data Center GPU Max 1100', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=49152MB, max_compute_units=448)
Python version 3.9
Any other relevant info about your setup intel-extension-for-pytorch is installed
Launcher context lunching with deepspeed
Docker context no docker using native libararies for mkl etc via oneapi
Additional context deepspeed setup keepet crashing over torch.ccl specifcly because it wanted a higher version of mkl than I had on my machine which is why I was forced to downgrade.