Open yash3056 opened 2 days ago
cc @muellerzr @SunMarc
Accelerate should handle multi-xpu distributed training just like cuda. What is the issue that you are facing ? are you receiving cuda specific error when wanting to train on xpu ?
@SunMarc it is not by default picking up all xpu devices like cuda.
number of gpu available
# Print number of available XPUs
print(f"Number of available XPUs: {torch.xpu.device_count()}")
(aza) sdp@bvcoe:~$ /home/sdp/.conda/envs/aza/bin/python /home/sdp/main.py
[WARNING] Failed to create Level Zero tracer: 2013265921
Number of available XPUs: 16
number of device selected automatically
(aza) sdp@bvcoe:~$ sudo xpu-smi dump -m 0,18
Timestamp, DeviceId, GPU Utilization (%), GPU Memory Used (MiB)
15:15:11.000, 0, 49.15, N/A
15:15:11.000, 1, 0.00, N/A
15:15:11.000, 2, 0.00, N/A
15:15:11.000, 3, 0.00, N/A
15:15:11.000, 4, 0.00, N/A
15:15:11.000, 5, 0.00, N/A
15:15:11.000, 6, 0.00, N/A
15:15:11.000, 7, 0.00, N/A
15:15:12.000, 0, 49.18, 8194.71
15:15:12.000, 1, 0.00, 121.34
15:15:12.000, 2, 0.00, 121.33
15:15:12.000, 3, 0.00, 121.34
15:15:12.000, 4, 0.00, 121.34
15:15:12.000, 5, 0.00, 121.28
15:15:12.000, 6, 0.00, 121.22
15:15:12.000, 7, 0.00, 121.22
15:15:13.000, 0, 49.17, 8194.71
15:15:13.000, 1, 0.00, 121.34
15:15:13.000, 2, 0.00, 121.33
15:15:13.000, 3, 0.00, 121.34
15:15:13.000, 4, 0.00, 121.34
15:15:13.000, 5, 0.00, 121.28
15:15:13.000, 6, 0.00, 121.22
15:15:13.000, 7, 0.00, 121.22
15:15:14.000, 0, 49.20, 8194.71
15:15:14.000, 1, 0.00, 121.34
15:15:14.000, 2, 0.00, 121.33
15:15:14.000, 3, 0.00, 121.34
15:15:14.000, 4, 0.00, 121.34
15:15:14.000, 5, 0.00, 121.28
15:15:14.000, 6, 0.00, 121.22
15:15:14.000, 7, 0.00, 121.22
@SunMarc , I'll take this issue, thx. Hi, @yash3056 , could you share your scripts and reproducing steps, I can dig a bit into this issue.
Feature request
DDP support for xpu like cuda, trainer automatically take multi cuda devices with the help of accelerate. Trainer should be able to use detect and use multiple xpu devices by default.
Motivation
Writing DDP codes with trainer is fast and effective. In pytorch writing training loop takes time.
Your contribution
None