huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.41k stars 27.09k forks source link

DDP for XPU in trainer #34881

Open yash3056 opened 2 days ago

yash3056 commented 2 days ago

Feature request

DDP support for xpu like cuda, trainer automatically take multi cuda devices with the help of accelerate. Trainer should be able to use detect and use multiple xpu devices by default.

Motivation

Writing DDP codes with trainer is fast and effective. In pytorch writing training loop takes time.

Your contribution

None

Rocketknight1 commented 2 days ago

cc @muellerzr @SunMarc

SunMarc commented 2 days ago

Accelerate should handle multi-xpu distributed training just like cuda. What is the issue that you are facing ? are you receiving cuda specific error when wanting to train on xpu ?

yash3056 commented 12 hours ago

@SunMarc it is not by default picking up all xpu devices like cuda.

number of gpu available

# Print number of available XPUs
print(f"Number of available XPUs: {torch.xpu.device_count()}")

(aza) sdp@bvcoe:~$ /home/sdp/.conda/envs/aza/bin/python /home/sdp/main.py
[WARNING] Failed to create Level Zero tracer: 2013265921
Number of available XPUs: 16

number of device selected automatically

(aza) sdp@bvcoe:~$ sudo xpu-smi dump -m 0,18
Timestamp, DeviceId, GPU Utilization (%), GPU Memory Used (MiB)
15:15:11.000,    0, 49.15,  N/A
15:15:11.000,    1, 0.00,  N/A
15:15:11.000,    2, 0.00,  N/A
15:15:11.000,    3, 0.00,  N/A
15:15:11.000,    4, 0.00,  N/A
15:15:11.000,    5, 0.00,  N/A
15:15:11.000,    6, 0.00,  N/A
15:15:11.000,    7, 0.00,  N/A
15:15:12.000,    0, 49.18, 8194.71
15:15:12.000,    1, 0.00, 121.34
15:15:12.000,    2, 0.00, 121.33
15:15:12.000,    3, 0.00, 121.34
15:15:12.000,    4, 0.00, 121.34
15:15:12.000,    5, 0.00, 121.28
15:15:12.000,    6, 0.00, 121.22
15:15:12.000,    7, 0.00, 121.22
15:15:13.000,    0, 49.17, 8194.71
15:15:13.000,    1, 0.00, 121.34
15:15:13.000,    2, 0.00, 121.33
15:15:13.000,    3, 0.00, 121.34
15:15:13.000,    4, 0.00, 121.34
15:15:13.000,    5, 0.00, 121.28
15:15:13.000,    6, 0.00, 121.22
15:15:13.000,    7, 0.00, 121.22
15:15:14.000,    0, 49.20, 8194.71
15:15:14.000,    1, 0.00, 121.34
15:15:14.000,    2, 0.00, 121.33
15:15:14.000,    3, 0.00, 121.34
15:15:14.000,    4, 0.00, 121.34
15:15:14.000,    5, 0.00, 121.28
15:15:14.000,    6, 0.00, 121.22
15:15:14.000,    7, 0.00, 121.22
yao-matrix commented 1 hour ago

@SunMarc , I'll take this issue, thx. Hi, @yash3056 , could you share your scripts and reproducing steps, I can dig a bit into this issue.