Closed tlbtlbtlb closed 7 years ago
Aha, I see what went wrong. Not a tensorflow bug. I tried to kill the entire training session (4 workers plus a parameter server) but missed this one worker, which ran for about 10 more minutes before dying with this error.
The only indication of a problem in this one worker's logs is a start master session about 30 seconds after the parameter server died. Then 10 minutes later, it crashed.
I tensorflow/core/distributed_runtime/master_session.cc:993] Start master session 28ab4c2385610d79 with config:
device_filters: "/job:ps"
device_filters: "/job:worker/task:2/cpu:0"
So not really a problem. But maybe universe-starter-agent should periodically report the status of the cluster.
Actual behavior
Start universe-starter-agent with:
After about 40 minutes, one worker crashed with the error below.
This wasn't a case of restarting the worker, it had been running successfully and was on episode 21. It was running inside a fresh container (from universe-perfmon) so it can't be a case of connecting to the wrong parameter server.
Versions