openai / universe

Universe: a software platform for measuring and training an AI's general intelligence across the world's supply of games, websites and other applications.
https://universe.openai.com
MIT License
7.45k stars 956 forks source link

tensorflow.python.framework.errors_impl.AbortedError: Graph handle is not found #112

Closed tlbtlbtlb closed 7 years ago

tlbtlbtlb commented 7 years ago

Actual behavior

Start universe-starter-agent with:

$ python train.py --num-workers 4--env-id flashgames.DuskDrive-v0 --log-dir /mnt/kube-efs/universe-perfmon/usa-flashgames.DuskDrive-v0-20170110-054801
  -m child

After about 40 minutes, one worker crashed with the error below.

This wasn't a case of restarting the worker, it had been running successfully and was on episode 21. It was running inside a fresh container (from universe-perfmon) so it can't be a case of connecting to the wrong parameter server.

Exception in thread Thread-10:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1021, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1003, in _run_fn
    status, run_metadata)
  File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.AbortedError: Graph handle is not found: 000000000000000d

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/experiment/universe-starter-agent/a3c.py", line 92, in run
    self._run()
  File "/experiment/universe-starter-agent/a3c.py", line 101, in _run
    self.queue.put(next(rollout_provider), timeout=600.0)
  File "/experiment/universe-starter-agent/a3c.py", line 139, in env_runner
    summary_writer.add_summary(summary, policy.global_step.eval())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variables.py", line 515, in eval
    return self._variable.eval(session=session)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 575, in eval
    return _eval_using_default_session(self, feed_dict, self.graph, session)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3633, in _eval_using_default_session
    return session.run(tensors, feed_dict)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 964, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.AbortedError: Graph handle is not found: 000000000000000d

[2017-01-10 06:27:56,508] Received signal 15: exiting
I tensorflow/core/distributed_runtime/master_session.cc:891] DeregisterGraph error: Aborted: Graph handle is not found: 000000000000002c. Possibly, this worker just restarted.
[2017-01-10 06:27:56,601] Killing and removing container: id=e8bf782d91fa3d6e3bb685e37ed4f9622f054bfb281f1e1e61021f1ed6798a06

Versions

Linux 0c0c02f2bfdb 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Python 3.5.2
Name: universe
Version: 0.21.1
Summary: Universe: a software platform for measuring and training an AI's general intelligence across the world's supply of games, websites and other applications.
Home-page: https://github.com/openai/universe
Author: OpenAI
Author-email: universe@openai.com
License: UNKNOWN
Location: /experiment/universe
Requires: autobahn, docker-py, docker-pycreds, fastzbarlight, go-vncdriver, gym, Pillow, PyYAML, six, twisted, ujson
---
Name: gym
Version: 0.7.0
Summary: The OpenAI Gym: A toolkit for developing and comparing your reinforcement learning agents.
Home-page: https://github.com/openai/gym
Author: OpenAI
Author-email: gym@openai.com
License: UNKNOWN
Location: /experiment/gym
Requires: numpy, requests, six, pyglet
---
Name: tensorflow
Version: 0.12.1
Summary: TensorFlow helps the tensors flow
Home-page: http://tensorflow.org/
Author: Google Inc.
Author-email: opensource@google.com
License: Apache 2.0
Location: /usr/local/lib/python3.5/dist-packages
Requires: protobuf, six, wheel, numpy
---
Name: numpy
Version: 1.11.0
Summary: NumPy: array processing for numbers, strings, records, and objects.
Home-page: http://www.numpy.org
Author: NumPy Developers
Author-email: numpy-discussion@scipy.org
License: BSD
Location: /usr/lib/python3/dist-packages
Requires:
---
Name: go-vncdriver
Version: 0.4.19
Summary: UNKNOWN
Home-page: UNKNOWN
Author: UNKNOWN
Author-email: UNKNOWN
License: UNKNOWN
Location: /usr/local/lib/python3.5/dist-packages
Requires: numpy
---
Name: Pillow
Version: 4.0.0
Summary: Python Imaging Library (Fork)
Home-page: http://python-pillow.org
Author: Alex Clark (Fork Author)
Author-email: aclark@aclark.net
License: Standard PIL License
Location: /usr/local/lib/python3.5/dist-packages
Requires: olefile
tlbtlbtlb commented 7 years ago

Aha, I see what went wrong. Not a tensorflow bug. I tried to kill the entire training session (4 workers plus a parameter server) but missed this one worker, which ran for about 10 more minutes before dying with this error.

The only indication of a problem in this one worker's logs is a start master session about 30 seconds after the parameter server died. Then 10 minutes later, it crashed.

I tensorflow/core/distributed_runtime/master_session.cc:993] Start master session 28ab4c2385610d79 with config:
device_filters: "/job:ps"
device_filters: "/job:worker/task:2/cpu:0"

So not really a problem. But maybe universe-starter-agent should periodically report the status of the cluster.