fix bug when getting the real accelerator's device number

faaany commented 3 months ago

What does this PR do?

This PR is a follow-up fix for PR #2826 and I want to correct my statement in that PR that torch.device(d).type == "xpu" is enough to check the xpu device just like the case in npu and mlu. This was my mistake. In fact, torch.device(0).type will always return "cuda" on XPU as can be seen from the pytorch code and from the pytorch offical doc at least for now. But we are working on a PR to support it in the future pytorch version. Also for NPU path, I think torch.device(0).typewill returncuda` as can be seen here.

In addition, users might pass device id that exceeds the available device count. For this case, we will not count that incorrect id to num_devices when calculating the balanced memory. So this PR actually fixes 2 issues:

num_devices for non-cuda devices will always be 0
num_devices will include device index that is larger than the available device number

Who can review?

@SunMarc and @muellerzr

yao-matrix commented 3 months ago

OK for me.

HuggingFaceDocBuilderDev commented 3 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

faaany commented 3 months ago

@SunMarc @muellerzr

faaany commented 3 months ago

Good news is that during this time our proposal to make torch.device(0) return 'xpu' on xpu got approved by stock pytorch (PR link). To avoid "reverse-engineering", let me close this PR. Thanks so much for the discussion! @SunMarc

muellerzr commented 3 months ago

That's great @faaany !

SunMarc commented 3 months ago

Nice :)

huggingface / accelerate

fix bug when getting the real accelerator's device number #2874

What does this PR do?

Who can review?