openai / universe-starter-agent

A starter agent that can solve a number of universe environments.
MIT License
1.1k stars 318 forks source link

Resets grpc properly between runs (prevent random thread / process crashes) #74

Closed louiehelm closed 7 years ago

louiehelm commented 7 years ago

Grpc has a nasty habit of hanging during SIGHUP rather than exiting cleanly when restarting tensorflow clusters via tmux session-kill calls. This issue (or one like it) is mentioned in passing here (https://github.com/openai/universe-starter-agent/issues/44)

This bug isn't ubiquitous, but it's definitely present. It's annoying when it poisons the processing environment and leaves it littered with defunct / oprhaned grpc processes that are blocking the ports you're about to re-open when restarting the agents.

I'll eventually fix Tensorflow itself. But for now, this patch resets the the environment more fully between runs so others don't end up with errors like the one below (which don't leave a trace in logs, BTW).

Please cherry pick just 3614229.

[ Sorry my PRs are gross. I'm new here. :D I know how to remove the extra 3 commits from your side if that helps. Let me know. ]

CUDA_VISIBLE_DEVICES= /usr/bin/python worker.py --log-dir /tmp/pong4579 --env-id PongDeterministic-v3 --num-workers 16 --visualise --job-name ps
[2017-02-24 17:17:08,231] Writing logs to file: /tmp/universe-21944.log
E0224 17:17:08.289240788   21944 server_chttp2.c:159]        {"created":"@1487931428.289200252","description":"No address added out of total 1 resolved","file":"external/grpc/src/core/ext/transport/chttp2/server/insecure/server_chttp2.c","file_line":125,"referenced_errors":[{"created":"@1487931428.289197827","description":"Failed to add port to server","file":"external/grpc/src/core/lib/iomgr/tcp_server_posix.c","file_line":634,"referenced_errors":[{"created":"@1487931428.289162540","description":"OS Error","errno":97,"file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.c","file_line":260,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::]:12222"}{"created":"@1487931428.289189647","description":"Unable to configure socket","fd":15,"file":"external/grpc/src/core/lib/iomgr/tcp_server_posix.c","file_line":355,"referenced_errors":[{"created":"@1487931428.289183016","description":"OS Error","errno":98,"file":"external/grpc/src/core/lib/iomgr/tcp_server_posix.c","file_line":331,"os_error":"Address already in use","syscall":"bind"}]}],"target_address":"ipv4:0.0.0.0:12222"}]}
Traceback (most recent call last):
  File "worker.py", line 170, in <module>
    tf.app.run()
  File "/usr/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "worker.py", line 165, in main
    config=tf.ConfigProto(device_filters=["/job:ps"]))
  File "/usr/lib/python3.6/site-packages/tensorflow/python/training/server_lib.py", line 144, in __init__
    self._server_def.SerializeToString(), status)
  File "/usr/lib/python3.6/contextlib.py", line 89, in __exit__
    next(self.gen)
  File "/usr/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.UnknownError: Could not start gRPC server
tlbtlbtlb commented 7 years ago

Added to master, thanks!