openai / universe

Universe: a software platform for measuring and training an AI's general intelligence across the world's supply of games, websites and other applications.
https://universe.openai.com
MIT License
7.45k stars 956 forks source link

docker.errors.APIError: 500 Server Error: aufs failed to remove root filesystem: device or resource busy #97

Closed tlbtlbtlb closed 7 years ago

tlbtlbtlb commented 7 years ago

Expected behavior

Universe-starter-agent learns neonrace using:

python train.py --num-workers 4 --env-id flashgames.NeonRace-v0 --log-dir /tmp/neonrace -m child

Actual behavior

This is on 0.tlb.devbox.sci.openai.org, running Ubuntu 14.04.5 LTS. When creating some docker envs, it gets the error "port is already allocated", and tries to shut down the container and try again with a new port. But it gets an "device busy" error from AUFS when trying to shut it down, and fails.

...
  File "/home/tlb/openai/universe/universe/remotes/docker_remote.py", line 222, in start
    e = self._start()
  File "/home/tlb/openai/universe/universe/remotes/docker_remote.py", line 298, in _start
    self._remove()
  File "/home/tlb/openai/universe/universe/remotes/docker_remote.py", line 310, in _remove
    self.client.remove_container(container=self._container_id, force=True)
  File "/home/tlb/.conda/envs/hacking/lib/python3.5/site-packages/docker/utils/decorators.py", line 21, in wrapped
    return f(self, resource_id, *args, **kwargs)
  File "/home/tlb/.conda/envs/hacking/lib/python3.5/site-packages/docker/api/container.py", line 293, in remove_container
    self._raise_for_status(res)
  File "/home/tlb/.conda/envs/hacking/lib/python3.5/site-packages/docker/client.py", line 174, in _raise_for_status
    raise errors.APIError(e, response, explanation=explanation)
docker.errors.APIError: 500 Server Error: Internal Server Error ("b'{"message":"Driver aufs failed to remove root filesystem 9bd643fa4eb0cf6583d1be3e8d4c9d032b3cf6f6f85600057fa4b7af6b1e5f54: rename /var/lib/docker/aufs/mnt/f1ef39639acb1d86ae3911c4569cf434aa58e2f6c25ec2bb3a45830591e6a26a /var/lib/docker/aufs/mnt/f1ef39639acb1d86ae3911c4569cf434aa58e2f6c25ec2bb3a45830591e6a26a-removing: device or resource busy"}'")
[2017-01-05 21:41:21,032] Killing and removing container: id=9bd643fa4eb0cf6583d1be3e8d4c9d032b3cf6f6f85600057fa4b7af6b1e5f54. (If this command errors, you can always kill all automanaged environments on this Docker daemon via: docker rm -f $(docker ps -q -a -f 'label=com.openai.automanaged=true')
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/tlb/.conda/envs/hacking/lib/python3.5/site-packages/docker/client.py", line 170, in _raise_for_status
    response.raise_for_status()
  File "/home/tlb/.conda/envs/hacking/lib/python3.5/site-packages/requests/models.py", line 844, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localunixsocket/v1.24/containers/9bd643fa4eb0cf6583d1be3e8d4c9d032b3cf6f6f85600057fa4b7af6b1e5f54?force=True&link=False&v=False

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tlb/.conda/envs/hacking/lib/python3.5/site-packages/docker/client.py", line 173, in _raise_for_status
    raise errors.NotFound(e, response, explanation=explanation)
docker.errors.NotFound: 404 Client Error: Not Found ("b'{"message":"No such container: 9bd643fa4eb0cf6583d1be3e8d4c9d032b3cf6f6f85600057fa4b7af6b1e5f54"}'")
[2017-01-05 21:41:21,108] Killing and removing container: id=9bd643fa4eb0cf6583d1be3e8d4c9d032b3cf6f6f85600057fa4b7af6b1e5f54. (If this command errors, you can always kill all automanaged environments on this Docker daemon via: docker rm -f $(docker ps -q -a -f 'label=com.openai.automanaged=true')

Versions

Linux devbox-10-6-77-51.sci.openai.org 3.13.0-74-generic #118-Ubuntu SMP Thu Dec 17 22:52:10 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Python 3.5.2 :: Continuum Analytics, Inc.
Name: universe
Version: 0.21.1
Summary: Universe: a software platform for measuring and training an AI's general intelligence across the world's supply of games, websites and other applications.
Home-page: https://github.com/openai/universe
Author: OpenAI
Author-email: universe@openai.com
License: UNKNOWN
Location: /home/tlb/openai/universe
Requires: autobahn, docker-py, docker-pycreds, fastzbarlight, go-vncdriver, gym, Pillow, PyYAML, six, twisted, ujson
---
Name: gym
Version: 0.7.0
Summary: The OpenAI Gym: A toolkit for developing and comparing your reinforcement learning agents.
Home-page: https://github.com/openai/gym
Author: OpenAI
Author-email: gym@openai.com
License: UNKNOWN
Location: /home/tlb/openai/gym
Requires: numpy, requests, six, pyglet
---
Name: tensorflow
Version: 0.12.0rc1
Summary: TensorFlow helps the tensors flow
Home-page: http://tensorflow.org/
Author: Google Inc.
Author-email: opensource@google.com
License: Apache 2.0
Location: /home/tlb/.conda/envs/hacking/lib/python3.5/site-packages
Requires: wheel, protobuf, six, numpy
---
Name: numpy
Version: 1.11.2
Summary: NumPy: array processing for numbers, strings, records, and objects.
Home-page: http://www.numpy.org
Author: NumPy Developers
Author-email: numpy-discussion@scipy.org
License: BSD
Location: /home/tlb/.conda/envs/hacking/lib/python3.5/site-packages
Requires:
---
Name: go-vncdriver
Version: 0.4.19
Summary: UNKNOWN
Home-page: UNKNOWN
Author: UNKNOWN
Author-email: UNKNOWN
License: UNKNOWN
Location: /home/tlb/openai/go-vncdriver
Requires: numpy
---
Name: Pillow
Version: 3.4.2
Summary: Python Imaging Library (Fork)
Home-page: http://python-pillow.org
Author: Alex Clark (Fork Author)
Author-email: aclark@aclark.net
License: Standard PIL License
Location: /home/tlb/.conda/envs/hacking/lib/python3.5/site-packages
Requires:
tlbtlbtlb commented 7 years ago

In the fairly common case of needing to pull the image, it fails because all 4 workers decide on the same port, and then when the pull finishes they all race to start containers with the same ports.

==> /tmp/usa-neonrace-w4/a3c.w-0.out <==
[2017-01-06 00:02:33,203] Writing logs to file: /tmp/universe-5345.log
[2017-01-06 00:02:33,216] Making new env: flashgames.NeonRace-v0
[2017-01-06 00:02:33,311] [0] Creating container: image=quay.io/openai/universe.flashgames:0.20.21. Run the same thing by hand as: docker run -p 5900:5900 -p 15900:15900 --privileged --cap-add SYS_ADMIN --ipc host quay.io/openai/universe.flashgames:0.20.21
[2017-01-06 00:02:33,313] Image quay.io/openai/universe.flashgames:0.20.21 not present locally; pulling
0.20.21: Pulling from openai/universe.flashgames

==> /tmp/usa-neonrace-w4/a3c.w-1.out <==
[2017-01-06 00:02:33,139] Writing logs to file: /tmp/universe-5346.log
2017-01-06 00:02:33,149] Making new env: flashgames.NeonRace-v0
[2017-01-06 00:02:33,248] [0] Creating container: image=quay.io/openai/universe.flashgames:0.20.21. Run the same thing by hand as: docker run -p 5900:5900 -p 15900:15900 --cap-add SYS_ADMIN --ipc host --privileged quay.io/openai/universe.flashgames:0.20.21
[2017-01-06 00:02:33,263] Image quay.io/openai/universe.flashgames:0.20.21 not present locally; pulling
0.20.21: Pulling from openai/universe.flashgames

==> /tmp/usa-neonrace-w4/a3c.w-2.out <==
[2017-01-06 00:02:33,153] Writing logs to file: /tmp/universe-5347.log
[2017-01-06 00:02:33,165] Making new env: flashgames.NeonRace-v0
[2017-01-06 00:02:33,274] [0] Creating container: image=quay.io/openai/universe.flashgames:0.20.21. Run the same thing by hand as: docker run -p 5900:5900 -p 15900:15900 --privileged --ipc host --cap-add SYS_ADMIN quay.io/openai/universe.flashgames:0.20.21
[2017-01-06 00:02:33,276] Image quay.io/openai/universe.flashgames:0.20.21 not present locally; pulling
0.20.21: Pulling from openai/universe.flashgames

==> /tmp/usa-neonrace-w4/a3c.w-3.out <==
[2017-01-06 00:02:33,176] Writing logs to file: /tmp/universe-5348.log
[2017-01-06 00:02:33,186] Making new env: flashgames.NeonRace-v0
[2017-01-06 00:02:33,275] [0] Creating container: image=quay.io/openai/universe.flashgames:0.20.21. Run the same thing by hand as: docker run -p 5900:5900 -p 15900:15900 --ipc host --cap-add SYS_ADMIN --privileged quay.io/openai/universe.flashgames:0.20.21
[2017-01-06 00:02:33,278] Image quay.io/openai/universe.flashgames:0.20.21 not present locally; pulling
0.20.21: Pulling from openai/universe.flashgames