Closed tyoc213 closed 3 years ago
If I remember correctly, if you want to use the XRT_WORKER
config, you also need to manually setup the DEVICE_MAP
like
export XRT_DEVICE_MAP="CPU:0;/job:localservice/replica:0/task:0/device:XLA_CPU:0|GPU:0;/job:localservice/replica:0/task:
0/device:XLA_GPU:0|GPU:1;/job:localservice/replica:0/task:0/device:XLA_GPU:1|GPU:2;/job:localservice/replica:0/task:0/de
vice:XLA_GPU:2|GPU:3;/job:localservice/replica:0/task:0/device:XLA_GPU:3"
or you could simply use GPU_NUM_DEVICES
and everything else(worker info) should be setup automatically, you can check https://github.com/pytorch/xla/blob/3eaee46ef679cc6a0f1f694bd0a007dbfd09c51b/.circleci/test.sh#L8
You can also check https://github.com/pytorch/xla/blob/3eaee46ef679cc6a0f1f694bd0a007dbfd09c51b/third_party/xla_client/computation_client.cc#L271 to see how we handle different ways to configure XLA.
XRT_TPU_CONFIG --> ParseEnvBasedTpuClusterConfig
TPU_NUM_DEVICES/GPU_NUM_DEVICES --> ParseEnvDeviceCounts
XRT_WORKERS + XRT_DEVICE_MAP --> ParseEnvDevices
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
❓
Im passing
XRT_WORKERS
aslocalservice:0;grpc://localhost:40934
Which is weird because obviously strings doesnt have
host_port
norworker_name
, but if I make it pass with thisThen it runs until
That is, it can pass 1 time over
def _get_devices_per_worker():
, the first time it gets correctlynum_gpus = os.environ.get(xenv.GPU_NUM_DEVICES, None)
to 1, but the second time, dont know why it doesnt pass there, so it ends throwingraise RuntimeError('Missing TPU or GPU configuration')
.I see, the second and third times is because I call again fit :)... but it works if I just hardcode the value of 1, dont know why between calls "deletes" the env var. on the same spawn func I have something like
So I guess that is why it request that env var 3 times...