Open dlwh opened 4 hours ago
cc @allenwang28
ok this has something to do with docker but i don't know what. Without docker this is fine. I'm guessing an env variable but i'm not sure
Ok so the issue is that TPUVMDockerCommandRunner only overrides the ssh_command_runner
to suppress the excess TPU-XXX-head, but the way that resource sneaks in is via the docker run command's explicit _with_environment_variables call that bypasses the env handling in the ssh runner
What happened + What you expected to happen
Somehow all TPU slice workers are getting the special TPU-{type}-head resource, despite the fact that it should only be going to the actual head. I'm very confused, since the code seems pretty clear but nevertheless you can see that we have two TPU slices but 8 tpu heads (which is one per worker)
Versions / Dependencies
Tested against Ray 2.34 and 2.36
Reproduction script
This cluster yaml reproduces the issue
Issue Severity
Medium: It is a significant difficulty but I can work around it.