Open woshiyyya opened 6 months ago
The author of this PR runs a distributed training workload on a 8-HPU node, however, ray detects there's an additional TPU in the cluster. It could be a ray core's device detection bug.
nightly
-
Low: It annoys or frustrates me.
@allenwang28 would you mind taking a look?
Thanks for the tag! Does the HPU node have something listed at /dev/vfio or /dev/accel*?
/dev/vfio
/dev/accel*
What happened + What you expected to happen
The author of this PR runs a distributed training workload on a 8-HPU node, however, ray detects there's an additional TPU in the cluster. It could be a ray core's device detection bug.
Versions / Dependencies
nightly
Reproduction script
-
Issue Severity
Low: It annoys or frustrates me.