huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.34k stars 875 forks source link

xpu: support xpu backend from stock pytorch (>=2.4) #2825

Closed dvrogozh closed 3 weeks ago

dvrogozh commented 1 month ago

Fixes: https://github.com/huggingface/transformers/issues/31237

XPU backend is available in the stock PyTorch starting from version 2.4 [1]. This commit extends huggingface accelerate to support XPU from both IPEX and the stock pytorch. IPEX is being tried first.

Raising this PR as WIP and Draft to facilitate further discussion around XPU backend enabling in huggingface and be able to communicate observed XPU issues back to PyTorch.

[1] https://github.com/pytorch/pytorch/issues/114842

@EikanWang, @fengyuan14, @guangyey, @jgong5, @kding1, @sywangyi

HuggingFaceDocBuilderDev commented 4 weeks ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dvrogozh commented 3 weeks ago

I tried this PR (+ https://github.com/huggingface/transformers/pull/31238) as much as I could in the IPEX-CPU, IPEX-XPU, Pytorch-XPU, Pytorch-CPU scenarios. Tried to run some tests from accelerate and transformers and some examples from transformers. All seem to work engaging with XPU when expected. I promote these PRs from drafts for the qualified review. Let me know if any concerns or any feedback needs to be addressed.

dvrogozh commented 3 weeks ago

Applied doc-builder style src/accelerate docs/source --max_len 119 to fix format issues identified by ci.

dvrogozh commented 3 weeks ago

@muellerzr : can you, please, help to run ci again? Also, is there anything else I can help with fixing in this PR to get it merged?

dvrogozh commented 3 weeks ago

I did not see such a failure before on this PR. Can this be something random since I can't associate this failure with the changes made. I also tried this locally and test worked for me running on cpu. @muellerzr, can you, please, advise?

FAILED tests/test_accelerator.py::AcceleratorTester::test_save_load_model_with_hooks_use_pytorch - assert 0.0007739067077636719 > 0.001
 +  where 0.0007739067077636719 = abs((4.019573211669922 - 4.0203471183776855))
 +    where 4.0203471183776855 = get_signature(Linear(in_features=2, out_features=4, bias=True))
dvrogozh commented 3 weeks ago

@SunMarc : thank you for retriggering failed ci. I see it's passing now. I guess my assumption that this was sporadic failure is true.

@SunMarc, @muellerzr : I have outlined current status of xpu backend in pytorch in https://github.com/huggingface/transformers/issues/31237. There are a number of issues in xpu backend which are being worked on right now. I believe however that this PR and PR in transformers (https://github.com/huggingface/transformers/issues/31237) are ready as the first step to enable xpu backend in huggingface on top of which we can gradually improve the support. Can you, please, outline acceptance requirements for these PRs on Huggingface side?