ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.22k stars 5.81k forks source link

[Core] Incorrectly detected TPU on a HPU-only node. #45302

Open woshiyyya opened 6 months ago

woshiyyya commented 6 months ago

What happened + What you expected to happen

image

The author of this PR runs a distributed training workload on a 8-HPU node, however, ray detects there's an additional TPU in the cluster. It could be a ray core's device detection bug.

Versions / Dependencies

nightly

Reproduction script

-

Issue Severity

Low: It annoys or frustrates me.

rynewang commented 6 months ago

@allenwang28 would you mind taking a look?

allenwang28 commented 6 months ago

Thanks for the tag! Does the HPU node have something listed at /dev/vfio or /dev/accel*?