huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.76k stars 941 forks source link

fix bug when getting the real accelerator's device number #2874

Closed faaany closed 3 months ago

faaany commented 3 months ago

What does this PR do?

This PR is a follow-up fix for PR #2826 and I want to correct my statement in that PR that torch.device(d).type == "xpu" is enough to check the xpu device just like the case in npu and mlu. This was my mistake. In fact, torch.device(0).type will always return "cuda" on XPU as can be seen from the pytorch code and from the pytorch offical doc at least for now. But we are working on a PR to support it in the future pytorch version. Also for NPU path, I think torch.device(0).typewill returncuda` as can be seen here.

In addition, users might pass device id that exceeds the available device count. For this case, we will not count that incorrect id to num_devices when calculating the balanced memory. So this PR actually fixes 2 issues:

Who can review?

@SunMarc and @muellerzr

yao-matrix commented 3 months ago

OK for me.

HuggingFaceDocBuilderDev commented 3 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

faaany commented 3 months ago

@SunMarc @muellerzr

faaany commented 3 months ago

Good news is that during this time our proposal to make torch.device(0) return 'xpu' on xpu got approved by stock pytorch (PR link). To avoid "reverse-engineering", let me close this PR. Thanks so much for the discussion! @SunMarc

muellerzr commented 3 months ago

That's great @faaany !

SunMarc commented 3 months ago

Nice :)