openai / universe

Universe: a software platform for measuring and training an AI's general intelligence across the world's supply of games, websites and other applications.
https://universe.openai.com
MIT License
7.45k stars 956 forks source link

Occasional timeout #99

Closed tlbtlbtlb closed 6 years ago

tlbtlbtlb commented 7 years ago

Expected behavior

Started with

/usr/local/bin/python train.py --num-workers 8 --env-id flashgames.NeonRace-v0 --log-dir /tmp/usa-flashgames.NeonRace-v0-w8-d14400 -m child

on a 8-core devbox.

Actual behavior

One of the 8 workers dies after 2 hours with backtrace below. AFAICT, it was getting reward messages seconds before timing out the rewarder connection.

[2017-01-06 05:52:38,680] Stats for the past 5.15s: vnc_updates_ps=19.6 n=1 reaction_time=None observation_lag=None action_lag=None processing_lag=1.75ms\xb11.04ms thinking_lag=15.21ms\xb17.55ms reward_ps=0.0 reward_total=0.0 vnc_bytes_ps[total]=475466.8 vnc_pixels_ps[total]=3748547.3 reward_lag=None rewarder_message_lag=None fps=4.27
[2017-01-06 05:52:43,709] Stats for the past 5.03s: vnc_updates_ps=20.1 n=1 reaction_time=None observation_lag=None action_lag=None processing_lag=3.43ms\xb11.34ms thinking_lag=4.25ms\xb11.69ms reward_ps=0.0 reward_total=0.0 vnc_bytes_ps[total]=528647.4 vnc_pixels_ps[total]=4237058.6 reward_lag=None rewarder_message_lag=None fps=4.57
[2017-01-06 05:52:44,799] [0] Closing rewarder connection
Exception in thread Thread-10:
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/experiment/universe-starter-agent/a3c.py", line 91, in run
    self._run()
  File "/experiment/universe-starter-agent/a3c.py", line 100, in _run
    self.queue.put(next(rollout_provider), timeout=600.0)
  File "/experiment/universe-starter-agent/a3c.py", line 124, in env_runner
    state, reward, terminal, info = env.step(action.argmax())
  File "/experiment/gym/gym/core.py", line 110, in step
    observation, reward, done, info = self._step(action)
  File "/experiment/universe/universe/wrappers/vectorize.py", line 51, in _step
    observation_n, reward_n, done_n, info = self.env.step(action_n)
  File "/experiment/gym/gym/core.py", line 110, in step
    observation, reward, done, info = self._step(action)
  File "/experiment/universe/universe/vectorized/vectorize_filter.py", line 34, in _step
    o_n, r_n, d_n, i = self.env.step(action_n)
  File "/experiment/gym/gym/core.py", line 110, in step
    observation, reward, done, info = self._step(action)
  File "/experiment/universe/universe/wrappers/multiprocessing_env.py", line 62, in _step
    observation_n, reward_n, done_n, info = self.env.step(action_n)
  File "/experiment/gym/gym/core.py", line 110, in step
    observation, reward, done, info = self._step(action)
  File "/experiment/gym/gym/core.py", line 402, in _step
    return self.env.step(action)
  File "/experiment/gym/gym/core.py", line 110, in step
    observation, reward, done, info = self._step(action)
  File "/experiment/gym/gym/core.py", line 379, in _step
    observation, reward, done, info = self.env.step(action)
  File "/experiment/gym/gym/core.py", line 110, in step
    observation, reward, done, info = self._step(action)
  File "/experiment/gym/gym/core.py", line 379, in _step
    observation, reward, done, info = self.env.step(action)
  File "/experiment/gym/gym/core.py", line 110, in step
    observation, reward, done, info = self._step(action)
  File "/experiment/universe/universe/wrappers/blocking_reset.py", line 53, in _step
    new_observation_n, new_reward_n, new_done_n, new_info = self.env.step(action_n)
  File "/experiment/gym/gym/core.py", line 110, in step
    observation, reward, done, info = self._step(action)
  File "/experiment/universe/universe/wrappers/logger.py", line 59, in _step
    observation_n, reward_n, done_n, info = self.env.step(action_n)
  File "/experiment/gym/gym/core.py", line 110, in step
    observation, reward, done, info = self._step(action)
  File "/experiment/universe/universe/wrappers/vision.py", line 21, in _step
    observation_n, reward_n, done_n, info_n = self.env.step(action_n)
  File "/experiment/gym/gym/core.py", line 110, in step
    observation, reward, done, info = self._step(action)
  File "/experiment/universe/universe/wrappers/timer.py", line 20, in _step
    observation_n, reward_n, done_n, info = self.env.step(action_n)
  File "/experiment/gym/gym/core.py", line 110, in step
    observation, reward, done, info = self._step(action)
  File "/experiment/universe/universe/wrappers/render.py", line 30, in _step
    observation_n, reward_n, done_n, info_n = self.env.step(action_n)
  File "/experiment/gym/gym/core.py", line 110, in step
    observation, reward, done, info = self._step(action)
  File "/experiment/universe/universe/wrappers/throttle.py", line 117, in _step
    observation_n, reward_n, done_n, info = self._substep(action_n)
  File "/experiment/universe/universe/wrappers/throttle.py", line 132, in _substep
    observation_n, reward_n, done_n, info = self.env.step(action_n)
  File "/experiment/gym/gym/core.py", line 110, in step
    observation, reward, done, info = self._step(action)
  File "/experiment/universe/universe/envs/vnc_env.py", line 449, in _step
    self._handle_crashed_n(info_n)
  File "/experiment/universe/universe/envs/vnc_env.py", line 522, in _handle_crashed_n
    raise error.Error('{}/{} environments have crashed! Most recent error: {}'.format(len(self.crashed), self.n, errors))
universe.error.Error: 1/1 environments have crashed! Most recent error: {'0': 'Rewarder session failed: Lost connection: connection was closed uncleanly (peer dropped the TCP connection without previous
 WebSocket closing handshake) (clean=False code=1006)'}
:
Traceback (most recent call last):
  File "worker.py", line 122, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 43, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "worker.py", line 114, in main
    run(args, server)
  File "worker.py", line 61, in run
    trainer.process(sess)
  File "/experiment/universe-starter-agent/a3c.py", line 257, in process
    rollout = self.pull_batch_from_queue()
  File "/experiment/universe-starter-agent/a3c.py", line 241, in pull_batch_from_queue
    rollout = self.runner.queue.get(timeout=600.0)
  File "/usr/lib/python3.5/queue.py", line 172, in get
    raise Empty
queue.Empty
[2017-01-06 06:02:23,322] Killing and removing container: id=9d2476c412e51e506bd940988c2a946817a2a0d36ba6287451bcea23a085374b
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/connectionpool.py", line 385, in _make_request
    httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'
:
During handling of the above exception, another exception occurred:
:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/connectionpool.py", line 387, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.5/http/client.py", line 1197, in getresponse
    response.begin()
  File "/usr/lib/python3.5/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.5/http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.5/socket.py", line 575, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out
:
During handling of the above exception, another exception occurred:
:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 403, in send
    timeout=timeout
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/connectionpool.py", line 623, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/util/retry.py", line 255, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/packages/six.py", line 310, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/connectionpool.py", line 578, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/connectionpool.py", line 389, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/connectionpool.py", line 314, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
requests.packages.urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)
:
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/experiment/universe/universe/remotes/docker_remote.py", line 328, in _remove
    self.client.remove_container(container=self._container_id, force=True)
  File "/usr/local/lib/python3.5/dist-packages/docker/utils/decorators.py", line 21, in wrapped
    return f(self, resource_id, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/docker/api/container.py", line 291, in remove_container
    self._url("/containers/{0}", container), params=params
  File "/usr/local/lib/python3.5/dist-packages/docker/utils/decorators.py", line 47, in inner
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/docker/client.py", line 147, in _delete
    return self.delete(url, **self._set_request_timeout(kwargs))
  File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 547, in delete
    return self.request('DELETE', url, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 475, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 585, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 479, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)
[2017-01-06 06:03:29,057] Killing and removing container: id=9d2476c412e51e506bd940988c2a946817a2a0d36ba6287451bcea23a085374b

Versions

Linux devbox-10-6-72-158.sci.openai.org 3.13.0-74-generic #118-Ubuntu SMP Thu Dec 17 22:52:10 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Python 3.5.2
Name: universe
Version: 0.21.1
Summary: Universe: a software platform for measuring and training an AI's general intelligence across the world's supply of games, websites and other applications.
Home-page: https://github.com/openai/universe
Author: OpenAI
Author-email: universe@openai.com
License: UNKNOWN
Location: /experiment/universe
Requires: autobahn, docker-py, docker-pycreds, fastzbarlight, go-vncdriver, gym, Pillow, PyYAML, six, twisted, ujson
---
Name: gym
Version: 0.7.0
Summary: The OpenAI Gym: A toolkit for developing and comparing your reinforcement learning agents.
Home-page: https://github.com/openai/gym
Author: OpenAI
Author-email: gym@openai.com
License: UNKNOWN
Location: /experiment/gym
Requires: numpy, requests, six, pyglet
---
Name: tensorflow
Version: 0.12.1
Summary: TensorFlow helps the tensors flow
Home-page: http://tensorflow.org/
Author: Google Inc.
Author-email: opensource@google.com
License: Apache 2.0
Location: /usr/local/lib/python3.5/dist-packages
Requires: six, wheel, protobuf, numpy
---
Name: numpy
Version: 1.11.0
Summary: NumPy: array processing for numbers, strings, records, and objects.
Home-page: http://www.numpy.org
Author: NumPy Developers
Author-email: numpy-discussion@scipy.org
License: BSD
Location: /usr/lib/python3/dist-packages
Requires:
---
Name: go-vncdriver
Version: 0.4.19
Summary: UNKNOWN
Home-page: UNKNOWN
Author: UNKNOWN
Author-email: UNKNOWN
License: UNKNOWN
Location: /experiment/go-vncdriver
Requires: numpy
---
Name: Pillow
Version: 4.0.0
Summary: Python Imaging Library (Fork)
Home-page: http://python-pillow.org
Author: Alex Clark (Fork Author)
Author-email: aclark@aclark.net
License: Standard PIL License
Location: /usr/local/lib/python3.5/dist-packages
Requires: olefile
tlbtlbtlb commented 7 years ago

Running

grep 'environments have crashed' /mnt/kube-efs/universe-perfmon/*/*.out

suggests this happens to about 10% of worker runs. Currently I see 31 instances out of 296 runs.

mabirck commented 7 years ago

I saw the same behavior in here!