google-research / planet

Learning Latent Dynamics for Planning from Pixels
https://danijar.com/planet
Apache License 2.0
1.17k stars 202 forks source link

ConnectionResetError: Connection reset by peer #6

Closed danijar closed 5 years ago

danijar commented 5 years ago

Reported by @lunar24 in thread #2. Please insert your full error message below (ideally as text and not screenshots) and list your operating system version, Python version, and dm_control rendering option (GLFW, EGL, etc) that you are using.

lunar24 commented 5 years ago

@danijar Thank you very much for your reply. I will attach a complete error message below. My Python version is 3.6. Regarding the rendering of dm_control, I ran sudo apt-get install libglfw3 libglew2.0 and successfully run dm_control/suite/exploration.py for visualization (according to the instructions, rendering should be GLFW). My operating system is ubuntu18.04. In addition, my computer has dual systems (windows and ubuntu). I'm not sure if the configuration problem of the Ubuntu system caused an error. I am still learning about reinforcement learning algorithms and related environment configurations, and some descriptions may be inaccurate. If there are any shortcomings, please point out. Thank you again for your paper's help, your algorithm has given me a lot of inspiration.

Error message
Traceback (most recent call last):
  File "/home/zzyx/planet-master/planet/training/running.py", line 194, in __iter__
    args = self._init_fn and self._init_fn(self._logdir)
  File "/home/zzyx/planet-master/planet/scripts/train.py", line 64, in start
    training.utility.collect_initial_episodes(config)
  File "/home/zzyx/planet-master/planet/training/utility.py", line 295, in collect_initial_episodes
    params.save_episode_dir)
  File "/home/zzyx/planet-master/planet/control/random_episodes.py", line 29, in random_episodes
    obs = env.reset()
  File "/home/zzyx/planet-master/planet/control/wrappers.py", line 375, in reset
    observ = self._env.reset(*args, **kwargs)
  File "/home/zzyx/planet-master/planet/control/wrappers.py", line 584, in reset
    return promise()
  File "/home/zzyx/planet-master/planet/control/wrappers.py", line 598, in _receive
    message, payload = self._conn.recv()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

WARNING:tensorflow:Worker 173e5364-d41b-4b82-b349-5f624b34d433 run 00339: Failed.
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/zzyx/planet-master/planet/scripts/train.py", line 133, in 
    tf.app.run(lambda _: main(args_), remaining)
  File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/home/zzyx/planet-master/planet/scripts/train.py", line 133, in 
    tf.app.run(lambda _: main(args_), remaining)
  File "/home/zzyx/planet-master/planet/scripts/train.py", line 106, in main
    for unused_score in run:
  File "/home/zzyx/planet-master/planet/training/running.py", line 210, in __iter__
    raise e
  File "/home/zzyx/planet-master/planet/training/running.py", line 194, in __iter__
    args = self._init_fn and self._init_fn(self._logdir)
  File "/home/zzyx/planet-master/planet/scripts/train.py", line 64, in start
    training.utility.collect_initial_episodes(config)
  File "/home/zzyx/planet-master/planet/training/utility.py", line 295, in collect_initial_episodes
    params.save_episode_dir)
  File "/home/zzyx/planet-master/planet/control/random_episodes.py", line 29, in random_episodes
    obs = env.reset()
  File "/home/zzyx/planet-master/planet/control/wrappers.py", line 375, in reset
    observ = self._env.reset(*args, **kwargs)
  File "/home/zzyx/planet-master/planet/control/wrappers.py", line 584, in reset
    return promise()
  File "/home/zzyx/planet-master/planet/control/wrappers.py", line 598, in _receive
    message, payload = self._conn.recv()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
[xcb] Unknown sequence number while processing queue
[xcb] Most likely this is a multi-threaded client and XInitThreads has not been called
[xcb] Aborting, sorry about that.
python3: ../../src/xcb_io.c:259: poll_for_event: Assertion `!xcb_xlib_threads_sequence_lost' failed
danijar commented 5 years ago

Thanks for providing the details! I doubt that your dual boot system causes any problems here. Could you try using the EGL rendering method for dm_control, please?

"Headless" hardware rendering (i.e. without a windowing system such as X11) requires EXT_platform_device support in the EGL driver. Recent Nvidia drivers support this. You will also need GLEW. On Debian and Ubuntu, this can be installed via sudo apt-get install libglew2.0.

PS: You can directly include long error messages and hide them using <details><summary>Summary text</summary><pre>Long error message</pre></details> at the end of your comment. I've updated your comment above to use this.

lunar24 commented 5 years ago

@danijar I'm sorry to have taken so long to reply to you. The main reason is that I encountered many new problems when trying to use EGL rendering, including driver crash, system black screen and so on. In order to meet the requirements of EGL rendering and test the planet program again. I installed the 390 version of NVIDIA graphics driver. (By the way, I again tested the results of glfw rendering, which is no different from previous error messages.) But when I was running with egl, I had a new problem (crying). Here is my error message.

Error message
zzyx@zzy-Vostro-5560:~/planet-master$ python3 '/home/zzyx/.local/lib/python3.6/site-packages/dm_control/suite/explore.py' 
Traceback (most recent call last):
  File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/suite/explore.py", line 23, in 
    from dm_control import suite
  File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/suite/__init__.py", line 28, in 
    from dm_control.suite import acrobot
  File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/suite/acrobot.py", line 24, in 
    from dm_control import mujoco
  File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/mujoco/__init__.py", line 18, in 
    from dm_control.mujoco.engine import action_spec
  File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/mujoco/engine.py", line 43, in 
    from dm_control import _render
  File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/_render/__init__.py", line 63, in 
    Renderer = import_func()  # pylint: disable=invalid-name
  File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/_render/__init__.py", line 34, in _import_egl
    from dm_control._render.pyopengl.egl_renderer import EGLContext
  File "/home/zzyx/.local/lib/python3.6/site-packages/dm_control/_render/pyopengl/egl_renderer.py", line 64, in 
    raise ImportError('Cannot initialize a headless EGL display.')
ImportError: Cannot initialize a headless EGL display.

I tried to call dm-control separately for testing. Running errors are the same. I tried to track where the error occurred. Because I don't know the correct operating value, I'm not sure if what I'm saying is the key to the problem. In this function, create_initialized_headless_egl_display() (/dm_control/_render/pyopengl/egl_renderer.py) My return value is EGL.EGL_NO_DISPLAY. The return value of EGL. eglQueryDevicesEXT () is an empty list.

Further tracing, in this function, EGL.eglQueryDevicesEXT() (/dm_control/_render/pyopengl/egl_ext.py) My num_devices = EGL. EGLint () value is 0.

Maybe my description is not accurate enough. I wonder if you could tell me the correct return values of these functions or give me some suggestions when you are free.. I will also continue to debug the program or ask questions in the dm-control column. Thanks very much.

astronautas commented 5 years ago

Thanks for trying to solve this @danijar. I verify that the same error occurs to me as well. I use EGL rendering.

There's one thing I've noticed while debugging the ExternalProcess class. The environment never seems to receive the last "reset" message. The training process, right after sending the reset message, tries to poll the environment process for the response to that reset message. Yet, the environment seems to be unreachable then.

Could be a race condition of "reset" and "close" messages? Pease verify as well that the environment process always writes to its own end of pipe, while the training process to its own as well (see https://docs.python.org/2/library/multiprocessing.html, Pipes section).

Error message
python -m planet.scripts.train --logdir /tmp/planet --config debug   --params '{tasks: [cheetah_run]}'
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scipy/signal/_max_len_seq.py:8: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._max_len_seq_inner import _max_len_seq_inner
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scipy/signal/_upfirdn.py:36: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._upfirdn_apply import _output_len, _apply
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scipy/optimize/_trlib/__init__.py:1: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._trlib import TRLIBQuadraticSubproblem
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scipy/optimize/_numdiff.py:10: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._group_columns import group_dense, group_sparse
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scipy/signal/spectral.py:10: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._spectral import _lombscargle
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scipy/stats/_continuous_distns.py:22: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from . import _stats
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scipy/signal/_peak_finding.py:13: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._peak_finding_utils import (
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scikit_image-0.14.2-py2.7-linux-x86_64.egg/skimage/restoration/_denoise.py:6: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ..restoration._denoise_cy import _denoise_bilateral, _denoise_tv_bregman
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/PyWavelets-1.0.2-py2.7-linux-x86_64.egg/pywt/__init__.py:16: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._extensions._pywt import *
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/PyWavelets-1.0.2-py2.7-linux-x86_64.egg/pywt/__init__.py:16: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._extensions._pywt import *
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/PyWavelets-1.0.2-py2.7-linux-x86_64.egg/pywt/__init__.py:16: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._extensions._pywt import *
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/PyWavelets-1.0.2-py2.7-linux-x86_64.egg/pywt/_swt.py:8: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._extensions._swt import swt_max_level, swt as _swt, swt_axis as _swt_axis
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scikit_image-0.14.2-py2.7-linux-x86_64.egg/skimage/restoration/non_local_means.py:3: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._nl_means_denoising import (
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scikit_image-0.14.2-py2.7-linux-x86_64.egg/skimage/measure/_marching_cubes_lewiner.py:7: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from . import _marching_cubes_lewiner_cy
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scikit_image-0.14.2-py2.7-linux-x86_64.egg/skimage/measure/_marching_cubes_classic.py:4: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from . import _marching_cubes_classic_cy
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scikit_image-0.14.2-py2.7-linux-x86_64.egg/skimage/measure/_label.py:1: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._ccomp import label_cython as clabel
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scikit_image-0.14.2-py2.7-linux-x86_64.egg/skimage/measure/pnpoly.py:1: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._pnpoly import _grid_points_in_poly, _points_in_poly
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scikit_image-0.14.2-py2.7-linux-x86_64.egg/skimage/transform/hough_transform.py:2: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._hough_transform import (_hough_circle,
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scikit_image-0.14.2-py2.7-linux-x86_64.egg/skimage/draw/draw.py:5: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._draw import (_coords_inside_image, _line, _line_aa,
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scikit_image-0.14.2-py2.7-linux-x86_64.egg/skimage/transform/radon_transform.py:6: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._warps_cy import _warp_fast
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scikit_image-0.14.2-py2.7-linux-x86_64.egg/skimage/transform/radon_transform.py:7: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._radon_transform import sart_projection_update
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scikit_image-0.14.2-py2.7-linux-x86_64.egg/skimage/transform/seam_carving.py:1: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from ._seam_carving import _seam_carve_v
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scikit_image-0.14.2-py2.7-linux-x86_64.egg/skimage/filters/rank/generic.py:57: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from . import generic_cy
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scikit_image-0.14.2-py2.7-linux-x86_64.egg/skimage/filters/rank/generic.py:57: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from . import generic_cy
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scikit_image-0.14.2-py2.7-linux-x86_64.egg/skimage/filters/rank/_percentile.py:27: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from . import percentile_cy
/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/scikit_image-0.14.2-py2.7-linux-x86_64.egg/skimage/filters/rank/bilateral.py:29: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  from . import bilateral_cy
planet/scripts/train.py:48: UserWarning: 
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was *originally* set to 'Qt5Agg' by the following code:
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/runpy.py", line 163, in _run_module_as_main
    mod_name, _Error)
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/runpy.py", line 102, in _get_module_details
    loader = get_loader(mod_name)
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/pkgutil.py", line 462, in get_loader
    return find_loader(fullname)
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/pkgutil.py", line 472, in find_loader
    for importer in iter_importers(fullname):
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/pkgutil.py", line 428, in iter_importers
    __import__(pkg)
  File "planet/__init__.py", line 19, in 
    from . import control
  File "planet/control/__init__.py", line 19, in 
    from . import planning
  File "planet/control/planning.py", line 23, in 
    from planet import tools
  File "planet/tools/__init__.py", line 22, in 
    from . import summary
  File "planet/tools/summary.py", line 19, in 
    import matplotlib.pyplot as plt
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/matplotlib/pyplot.py", line 71, in 
    from matplotlib.backends import pylab_setup
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/matplotlib/backends/__init__.py", line 16, in 
    line for line in traceback.format_stack()

  matplotlib.use('Agg')
/home/lukas/workspace/planet_src/planet/scripts/train.py:48: UserWarning: 
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was *originally* set to 'Qt5Agg' by the following code:
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/runpy.py", line 163, in _run_module_as_main
    mod_name, _Error)
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/runpy.py", line 102, in _get_module_details
    loader = get_loader(mod_name)
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/pkgutil.py", line 462, in get_loader
    return find_loader(fullname)
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/pkgutil.py", line 472, in find_loader
    for importer in iter_importers(fullname):
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/pkgutil.py", line 428, in iter_importers
    __import__(pkg)
  File "planet/__init__.py", line 19, in 
    from . import control
  File "planet/control/__init__.py", line 19, in 
    from . import planning
  File "planet/control/planning.py", line 23, in 
    from planet import tools
  File "planet/tools/__init__.py", line 22, in 
    from . import summary
  File "planet/tools/summary.py", line 19, in 
    import matplotlib.pyplot as plt
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/matplotlib/pyplot.py", line 71, in 
    from matplotlib.backends import pylab_setup
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/matplotlib/backends/__init__.py", line 16, in 
    line for line in traceback.format_stack()

  matplotlib.use('Agg')
WARNING:tensorflow:Worker cede67db-6884-4b1b-8964-af6cceb7c6a6 run 00001: Create directory '/tmp/planet/00001'.
WARNING:tensorflow:Worker cede67db-6884-4b1b-8964-af6cceb7c6a6 run 00001: Start.
INFO:tensorflow:Collecting 2+ random episodes (test-cheetah_run).
I0318 22:51:10.230741 139822648039168 __init__.py:34] MuJoCo library version is: 200
INFO:tensorflow:Recorded episode 20190318T225111-1014ad7a04e148a68a03aefeb761db3a.
INFO:tensorflow:Recorded episode 20190318T225111-6f7aac6341d6416ba4c3a77a231384de.
INFO:tensorflow:Collecting 2+ random episodes (train-cheetah_run).
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
      after 125 requests (125 known processed) with 0 events remaining.
WARNING:tensorflow:Worker cede67db-6884-4b1b-8964-af6cceb7c6a6 run 00001: Exception:
Traceback (most recent call last):
  File "planet/training/running.py", line 194, in __iter__
    args = self._init_fn and self._init_fn(self._logdir)
  File "/home/lukas/workspace/planet_src/planet/scripts/train.py", line 64, in start
    training.utility.collect_initial_episodes(config)
  File "planet/training/utility.py", line 295, in collect_initial_episodes
    params.save_episode_dir)
  File "planet/control/random_episodes.py", line 29, in random_episodes
    obs = env.reset()
  File "planet/control/wrappers.py", line 375, in reset
    observ = self._env.reset(*args, **kwargs)
  File "planet/control/wrappers.py", line 584, in reset
    return promise()
  File "planet/control/wrappers.py", line 598, in _receive
    message, payload = self._conn.recv()
IOError: [Errno 104] Connection reset by peer

WARNING:tensorflow:Worker cede67db-6884-4b1b-8964-af6cceb7c6a6 run 00001: Failed.
Traceback (most recent call last):
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/lukas/workspace/planet_src/planet/scripts/train.py", line 133, in 
    tf.app.run(lambda _: main(args_), remaining)
  File "/home/lukas/miniconda3/envs/planet/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/home/lukas/workspace/planet_src/planet/scripts/train.py", line 133, in 
    tf.app.run(lambda _: main(args_), remaining)
  File "/home/lukas/workspace/planet_src/planet/scripts/train.py", line 106, in main
    for unused_score in run:
  File "planet/training/running.py", line 210, in __iter__
    raise e
IOError: [Errno 104] Connection reset by peer
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
      after 177 requests (177 known processed) with 12 events remaining.
doralune commented 5 years ago

I hope there is a clue from this information. I got different results depending on the rendering option (the environment variable MUJOCO_GL) when running python -m planet.scripts.train.

MUJOCO_GL=glfw    --> ConnectionResetError: [Errno 104] Connection reset by peer
MUJOCO_GL=egl     --> ImportError: Cannot initialize a headless EGL display.
MUJOCO_GL=osmesa  --> working on my environment

my environment

- Ubuntu 18.04
- python 3.6.7
- tensorflow-gpu==1.12.0
- tensorflow-probability==0.5.0
- dm_control (using mujoco200_linux when installation) 
- mujoco_py (using mjpro150_linux when installation)
danijar commented 5 years ago

@lunar24 Sorry to hear! At least we can narrow down your problem to getting headless rendering to work with dm_control. PlaNet collects data from within the TensorFlow graph, so I think headless rendering will be necessary.

Besides this, you might be able to get better support for this at the dm_control repository -- I'm not sure how to debug the headless rendering and unfortunately I don't have bandwidth to look into the dm_control code as you suggested. Have you tried asking there?

@doralune Thank you for investigating this. I wouldn't expect the GLFW renderer to work in sub processes or threads. EGL works for me but might be more difficult to set up as it requires a newer graphics driver according to the dm_control readme. Mesa should be easier to set up and handle parallel access but could be slower since it's rendering on the CPU.

lunar24 commented 5 years ago

Hello, a good day begins with debugging and ends with a new bug.

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

/usr/lib/python3.6/runpy.py:125: RuntimeWarning: 'planet.scripts.train' found in sys.modules after import of package 'planet.scripts', but prior to execution of 'planet.scripts.train'; this may result in unpredictable behaviour warn(RuntimeWarning(msg)) WARNING:tensorflow:Worker b749528b-6816-4f45-b096-3708f96cde3d run 00001: Create directory '/home/zzyx/testrun/00001'. WARNING:tensorflow:Worker b749528b-6816-4f45-b096-3708f96cde3d run 00001: Start. INFO:tensorflow:Collecting 5+ random episodes (test-cheetah_run). I0321 04:18:12.912200 139655267034944 acceleratesupport.py:13] OpenGL_accelerate module loaded I0321 04:18:12.919322 139655267034944 arraydatatype.py:270] Using accelerated ArrayDatatype I0321 04:18:13.281697 139655267034944 init.py:34] MuJoCo library version is: 200 INFO:tensorflow:Recorded episode 20190321T041820-ba7bcbd480bd40c08439c0d8c06af1dc. INFO:tensorflow:Recorded episode 20190321T041827-ecf060c6c94d4cdcbe0d7dbeb9b604ea. INFO:tensorflow:Recorded episode 20190321T041834-aceb5dccabc64075a0554eb8c243005a. INFO:tensorflow:Recorded episode 20190321T041841-30e62906fd2346a9af2feaa34f7c6065. INFO:tensorflow:Recorded episode 20190321T041848-4e906ce5a91a477d9a5be5c152943f09. INFO:tensorflow:Collecting 5+ random episodes (train-cheetah_run). INFO:tensorflow:Recorded episode 20190321T041856-7b875a6753cc4dbd9c9d4b7f8344c36f. INFO:tensorflow:Recorded episode 20190321T041903-9e5f8326ab4940b7a32aa4ca4d183bd1. INFO:tensorflow:Recorded episode 20190321T041910-468be8b33a024e3791b5260bb7d5e0fe. INFO:tensorflow:Recorded episode 20190321T041918-a5ce1ca5ae5c48009fdc2da996a90c34. INFO:tensorflow:Recorded episode 20190321T041925-1b17525948144df9b69b58a02f4d9bc0. WARNING:tensorflow:From /home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, use tf.py_function, which takes a python function which manipulates tf eager tensors instead of numpy arrays. It's easy to convert a tf eager tensor to an ndarray (just call tensor.numpy()) but having access to eager tensors means tf.py_functions can use accelerators such as GPUs as well as being differentiable using a gradient tape.

WARNING:tensorflow:From /home/zzyx/planet-master/planet/tools/preprocess.py:24: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. INFO:tensorflow:Start a new run and write summaries and checkpoints to /home/zzyx/testrun/00001. WARNING:tensorflow:From /home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. INFO:tensorflow:Build TensorFlow compute graph. WARNING:tensorflow:From /home/zzyx/planet-master/planet/networks/conv_ha.py:30: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.conv2d instead.

WARNING:tensorflow:From /home/zzyx/planet-master/planet/networks/conv_ha.py:34: flatten (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.flatten instead.

WARNING:tensorflow:From /home/zzyx/planet-master/planet/tools/overshooting.py:80: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version. Instructions for updating: Please use keras.layers.RNN(cell), which is equivalent to this API WARNING:tensorflow:From /home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py:626: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. WARNING:tensorflow:From /home/zzyx/planet-master/planet/models/rssm.py:94: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dense instead.

[<tensorflow.python.ops.template.Template object at 0x7f03cc3e3c50>, <tensorflow.python.ops.template.Template object at 0x7f03cc3e3940>, <tensorflow.contrib.rnn.python.ops.gru_ops.GRUBlockCell object at 0x7f03cc388da0>] WARNING:tensorflow:Worker b749528b-6816-4f45-b096-3708f96cde3d run 00001: Exception: Traceback (most recent call last): File "/home/zzyx/planet-master/planet/training/running.py", line 199, in iter for value in self._process_fn(self._logdir, args): File "/home/zzyx/planet-master/planet/scripts/train.py", line 91, in process training.define_model, dataset, logdir, config): File "/home/zzyx/planet-master/planet/training/utility.py", line 160, in train score, summary = model_fn(data, trainer, config) File "/home/zzyx/planet-master/planet/training/define_model.py", line 71, in define_model config.overshooting + 1) File "/home/zzyx/planet-master/planet/tools/overshooting.py", line 80, in overshooting swap_memory=True) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func return func(args, *kwargs) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 671, in dynamic_rnn dtype=dtype) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 879, in _dynamic_rnn_loop swap_memory=swap_memory) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop return_same_structure) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop pred, body, original_loop_vars, loop_vars, shape_invariants) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop body_result = body(packed_vars_for_body) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3525, in body = lambda i, lv: (i + 1, orig_body(*lv)) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 845, in _time_step skip_conditionals=True) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 276, in _rnn_step new_output, new_state = call_cell() File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 833, in call_cell = lambda: cell(input_t, state) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 234, in call return super(RNNCell, self).call(inputs, state) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 534, in call _add_elements_to_collection(self.updates, ops.GraphKeys.UPDATE_OPS) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 653, in updates return self._updates + self._gather_children_attribute('updates') File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1648, in _gather_children_attribute getattr(layer, attribute) for layer in self._layers)) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1648, in getattr(layer, attribute) for layer in self._layers)) AttributeError: 'Template' object has no attribute 'updates'

WARNING:tensorflow:Worker b749528b-6816-4f45-b096-3708f96cde3d run 00001: Failed. Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, runglobals) File "/home/zzyx/planet-master/planet/scripts/train.py", line 133, in tf.app.run(lambda : main(args_), remaining) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run sys.exit(main(argv)) File "/home/zzyx/planet-master/planet/scripts/train.py", line 133, in tf.app.run(lambda : main(args_), remaining) File "/home/zzyx/planet-master/planet/scripts/train.py", line 106, in main for unused_score in run: File "/home/zzyx/planet-master/planet/training/running.py", line 210, in iter raise e File "/home/zzyx/planet-master/planet/training/running.py", line 199, in iter for value in self._process_fn(self._logdir, args): File "/home/zzyx/planet-master/planet/scripts/train.py", line 91, in process training.define_model, dataset, logdir, config): File "/home/zzyx/planet-master/planet/training/utility.py", line 160, in train score, summary = model_fn(data, trainer, config) File "/home/zzyx/planet-master/planet/training/define_model.py", line 71, in define_model config.overshooting + 1) File "/home/zzyx/planet-master/planet/tools/overshooting.py", line 80, in overshooting swap_memory=True) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func return func(args, *kwargs) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 671, in dynamic_rnn dtype=dtype) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 879, in _dynamic_rnn_loop swap_memory=swap_memory) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop return_same_structure) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop pred, body, original_loop_vars, loop_vars, shape_invariants) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop body_result = body(packed_vars_for_body) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3525, in body = lambda i, lv: (i + 1, orig_body(*lv)) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 845, in _time_step skip_conditionals=True) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 276, in _rnn_step new_output, new_state = call_cell() File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 833, in call_cell = lambda: cell(input_t, state) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 234, in call return super(RNNCell, self).call(inputs, state) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 534, in call _add_elements_to_collection(self.updates, ops.GraphKeys.UPDATE_OPS) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 653, in updates return self._updates + self._gather_children_attribute('updates') File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1648, in _gather_children_attribute getattr(layer, attribute) for layer in self._layers)) File "/home/zzyx/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1648, in getattr(layer, attribute) for layer in self._layers)) AttributeError: 'Template' object has no attribute 'updates' In my ongoing debugging, I also have a further understanding of the first two mistakes. The first two errors occur in the rendering process. The third error may occur in the tensorflow configuration. Because in the process of using osmesa rendering, multiprocess runs well and correctly outputs the markup prompts I made. I have also tried to trace new error cues. The key to the problem is that the inherited variable'layers'is empty, so there is no'updates' variable. I tried to reinstall TF to avoid installation errors or file corruption, but it didn't work. I tried to query the solution online, but I didn't find the right solution.

I noticed that there were many hints at runtime that some functions would be replaced in future versions. I don't know if this error is due to GPU problems, TF version problems or other reasons. This is my running version.

Version
Keras==2.2.4
Keras-Applications==1.0.7
Keras-Preprocessing==1.0.9
tensorboard==1.13.1
tensorflow==1.13.1
tensorflow-estimator==1.13.0
tensorflow-gpu==1.13.1
tensorflow-probability==0.6.0
Python 3.6.7

The only good news is that I have deepened my understanding of your paper. Thank you again for your patient help.Best regards.

danijar commented 5 years ago

I think you're getting close to running the code. I haven't updated the code for the newest TensorFlow release, which introduces a few breaking changes. Would it be possible for you to test with TensorFlow 1.12.0, and TensorFlow Probability 0.5.0?

To summarize the findings so far:

I'll also include this information in the readme.

lunar24 commented 5 years ago

@danijar I can't wait to tell you the good news. After replacing the versions of tensorflow and tensorflow-probability, the code ran successfully. Although CPU rendering is slow, this is certainly good news. I will also try other parameters in the model. The program is still running. If there is any further information, I will consult you at the first time. I really appreciate your help very much^_^. I sincerely hope that I can learn more from you in the future. Best regards!

astronautas commented 5 years ago

@danijar Good news, I've managed to make it run :). It did not work on Ubuntu 16.04, yet on 18.04 it ran fine. I think it would be good to update the readme with a suggestion to use Ubuntu 18.04. I am exactly sure whether this was an OS issue or some clash of dependencies. Though, all of us here who got this running use 18.04.

I want to say as well that the dm_control cannot be installed with setup.py (there's no such PyPi package currently). It has to be installed from the official dm_control repository.

danijar commented 5 years ago

@lunar24 That is great to hear! I'm closing this thread but feel free to reopen if the same error shows up again or otherwise to open a new issue.

@astronautas Good idea, I've added the hint.

danijar commented 5 years ago

@astronautas @lunar24 @doralune @JamesLuoau I've updated the code to the newest version of TensorFlow and added an option to isolate environments into threads instead of processes. At least on my workstation, all three rendering options work now under default settings. If you are still running into problems, please let me know.