Closed concretevitamin closed 1 year ago
Got a similar issue during run_smoke_tests.sh
.
Got googleapiclient.errors.HttpError: <HttpError 400 when requesting https://tpu.googleapis.com/v2alpha1/projects/intercloud-320520/locations/us-central1-b/nodes/ray-test-tpu-vm-zhwu-164b-15-head-822bdd52-tpu?updateMask=labels&alt=json returned "Cloud TPU received a bad request. update is not supported while in state STARTING [EID: 0x59a3b65ce30f480c]". Details: "Cloud TPU received a bad request. update is not supported while in state STARTING [EID: 0x59a3b65ce30f480c]">
@concretevitamin @Michaelvll I'm still trying to reproduce the error. If this happens to you again could you send the entire logs to me thanks!
Will do, maybe try an offending test (or a command) in a loop? Alternatively we can try to code against the error directly even if repro fails.
I ran the below script which stop/start TPU VM for 100 times yesterday but unfortunately haven't encountered the error :(
Not sure if it's also related to gcloud
version. I'm using the latest (Google Cloud SDK 395.0.0
).
#!/bin/bash
sky launch examples/tpu/tpuvm_mnist.yaml -c tpuvm -y
for i in {1..100}
do
sleep 60
sky stop -y tpuvm
sky status --refresh | grep tpuvm | grep STOPPED
sky start --retry-until-up -y tpuvm
done
I investigated a bit and we may be able to add some try except with node.py
or node_provider.py
for this particular error. Will continue debugging.
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.
This issue was closed because it has been stalled for 10 days with no activity.
Encountered this before, and saw this in
test_tpu_vm
smoke test:We should guard against
by retrying + not prematurely going into failover loop. This message was observed outside of smoke tests as well.