Multiprocessing EOF error

jluo-bgl commented 5 years ago

Hi, I'm trying to replicate your result, however the code not running well in my python 3.5 + macos, for example, multproccing I got EOF Error, I have fixed many this kind of error but I'm not sure how many I'm going to got further, so that knowing your tested environment would help me a lot. Thanks.

danijar commented 5 years ago

Hi @JamesLuoau, the code works for us on Debian with Python 2.7 and Python 3.5. I'm not sure why there should be a multiprocessing error here though -- the code is not parallelized and TensorFlow only uses threads as far as I know. Maybe try commenting our the ExternalProcess class in wrappers.py and make sure that it isn't used.

2877992943 commented 5 years ago

scripts/tasks.py

  from dm_control import suite
  def env_ctor():
    env = control.wrappers.DeepMindWrapper(suite.load(domain, task), (64, 64))
    env = control.wrappers.ActionRepeat(env, action_repeat)
    env = control.wrappers.LimitDuration(env, max_length)
    env = control.wrappers.PixelObservations(env, (64, 64), np.uint8, 'image')
    env = control.wrappers.ConvertTo32Bit(env)
    return env
  #env = control.wrappers.ExternalProcess(env_ctor) # change here
  env=env_ctor()
  return env

seems that this can work on macos

however, find "nan" in log print

INFO:tensorflow:Graph contains 5144438 trainable variables.
2019-02-28 15:31:46.425061: E tensorflow/core/common_runtime/session.cc:75] Not found: No session factory registered for the given session options: {target: "local" config: gpu_options { allow_growth: true }} Registered factories are {DIRECT_SESSION, GRPC_SESSION}.
2019-02-28 15:31:46.425243: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
INFO:tensorflow:
--------------------------------------------------
Epoch 1 phase train (phase step 0, global step 0).
step/score/loss/zs_entropy/zs_divergence =  [0, nan, 11855.6396, 35.8041, 3.36079955]

danijar commented 5 years ago

Yes, that's correct. The nan is intentional and shows up as the mean planning score for steps in which no planning simulation happens.

jluo-bgl commented 5 years ago

Hi, I can confirm that tensorflow 1.13 and tensorflow-probability 0.6.0 not working, the script tools/test_overshooting.py not able to pass, an exception "tensorflow attributeerror: 'template' object has no attribute 'updates'" will throw.

However, if I downgrade to tensorflow 1.12.0 and tensorflow-probability 0.5.0, test_overshooting.py passes.

Could you please provide a requirements.txt file for your environment? Thanks a lot.

I'm now getting an error below, appreciated for your help.

UnknownError (see above for traceback): RuntimeError: Cannot make context <dm_control._render.glfw_renderer.GLFWContext object at 0x13574b630> current on thread <_DummyThread(Dummy-5, started daemon 123145467191296)>: this context is already current on another thread <_DummyThread(Dummy-4, started daemon 123145466118144)>.
Traceback (most recent call last):

  File "/Users/user_name/anaconda/envs/planet/lib/python3.5/site-packages/tensorflow/python/ops/script_ops.py", line 206, in __call__
    ret = func(*args)

  File "/Users/user_name/git/planet/planet/control/in_graph_batch_env.py", line 95, in <lambda>
    lambda a: self._batch_env.step(a)[:3], [action],

  File "/Users/user_name/git/planet/planet/control/batch_env.py", line 86, in step
    for env, action in zip(self._envs, actions)]

  File "/Users/user_name/git/planet/planet/control/batch_env.py", line 86, in <listcomp>
    for env, action in zip(self._envs, actions)]

  File "/Users/user_name/git/planet/planet/control/wrappers.py", line 90, in step
    obs, reward, done, info = self._env.step(action)

  File "/Users/user_name/git/planet/planet/control/wrappers.py", line 367, in step
    transition = self._env.step(action, *args, **kwargs)

  File "/Users/user_name/git/planet/planet/control/wrappers.py", line 445, in step
    observ, reward, done, info = self._env.step(action)

  File "/Users/user_name/git/planet/planet/control/wrappers.py", line 156, in step
    obs[self._key] = self._render_image()

  File "/Users/user_name/git/planet/planet/control/wrappers.py", line 165, in _render_image
    image = self._env.render('rgb_array')

  File "/Users/user_name/git/planet/planet/control/wrappers.py", line 261, in render
    *self._render_size, camera_id=self._camera_id)

  File "/Users/user_name/anaconda/envs/planet/lib/python3.5/site-packages/dm_control/mujoco/engine.py", line 171, in render
    physics=self, height=height, width=width, camera_id=camera_id)

  File "/Users/user_name/anaconda/envs/planet/lib/python3.5/site-packages/dm_control/mujoco/engine.py", line 574, in __init__
    with self._physics.contexts.gl.make_current() as ctx:

  File "/Users/user_name/anaconda/envs/planet/lib/python3.5/contextlib.py", line 59, in __enter__
    return next(self.gen)

  File "/Users/user_name/anaconda/envs/planet/lib/python3.5/site-packages/dm_control/_render/base.py", line 116, in make_current
    _CURRENT_THREAD_FOR_CONTEXT[id(self)]))

RuntimeError: Cannot make context <dm_control._render.glfw_renderer.GLFWContext object at 0x13574b630> current on thread <_DummyThread(Dummy-5, started daemon 123145467191296)>: this context is already current on another thread <_DummyThread(Dummy-4, started daemon 123145466118144)>.

     [[node graph/collection/should_collect_cartpole_balance/simulate-1/train-cartpole_balance-cem-12/scan/while/simulate/environment/simulate/step (defined at /Users/user_name/git/planet/planet/control/in_graph_batch_env.py:96)  = PyFunc[Tin=[DT_FLOAT], Tout=[DT_UINT8, DT_FLOAT, DT_BOOL], token="pyfunc_7", _device="/job:localhost/replica:0/task:0/device:CPU:0"](graph/collection/should_collect_cartpole_balance/simulate-1/train-cartpole_balance-cem-12/scan/while/simulate/Identity_5)]]

danijar commented 5 years ago

Hi @JamesLuoau, these both sound like issues with other libraries. Please ask about the AttributeError on the TensorFlow Probability repo and for the multi-threaded rendering error on the dm_control repo. Neither of these happen for me under Python 3.5, TensorFlow 1.12.0, and TensorFlow Probability 0.5.0. If many people are experiencing this, please upvote the comment above this one.

astronautas commented 5 years ago

@danijar, thanks to you and your team for such an interesting contribution to RL!

I am planning to scale up this implementation to a multi-agent environment to see how well it performs. I am facing the same problem as @JamesLuoau though. Here's the excerpt from the logs:

Traceback (most recent call last):

  File "/home/username/miniconda3/envs/planet/lib/python2.7/site-packages/tensorflow/python/ops/script_ops.py", line 206, in __call__
    ret = func(*args)

  File "planet/control/in_graph_batch_env.py", line 95, in <lambda>
    lambda a: self._batch_env.step(a)[:3], [action],

  File "planet/control/batch_env.py", line 86, in step
    for env, action in zip(self._envs, actions)]

  File "planet/control/wrappers.py", line 90, in step
    obs, reward, done, info = self._env.step(action)

  File "planet/control/wrappers.py", line 367, in step
    transition = self._env.step(action, *args, **kwargs)

  File "planet/control/wrappers.py", line 445, in step
    observ, reward, done, info = self._env.step(action)

  File "planet/control/wrappers.py", line 156, in step
    obs[self._key] = self._render_image()

  File "planet/control/wrappers.py", line 165, in _render_image
    image = self._env.render('rgb_array')

  File "planet/control/wrappers.py", line 261, in render
    *self._render_size, camera_id=self._camera_id)

  File "/home/username/miniconda3/envs/planet/lib/python2.7/site-packages/dm_control/mujoco/engine.py", line 171, in render
    physics=self, height=height, width=width, camera_id=camera_id)

  File "/home/username/miniconda3/envs/planet/lib/python2.7/site-packages/dm_control/mujoco/engine.py", line 574, in __init__
    with self._physics.contexts.gl.make_current() as ctx:

  File "/home/username/miniconda3/envs/planet/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()

  File "/home/username/miniconda3/envs/planet/lib/python2.7/site-packages/dm_control/_render/base.py", line 116, in make_current
    _CURRENT_THREAD_FOR_CONTEXT[id(self)]))

RuntimeError: Cannot make context <dm_control._render.glfw_renderer.GLFWContext object at 0x7f7b417a0550> current on thread <_DummyThread(Dummy-5, started daemon 140166375134976)>: this context is already current on another thread <_DummyThread(Dummy-4, started daemon 140166358349568)>.

     [[node graph/collection/should_collect_cheetah_run/simulate-1/train-cheetah_run-cem-12/scan/while/simulate/environment/simulate/step (defined at planet/control/in_graph_batch_env.py:96)  = PyFunc[Tin=[DT_FLOAT], Tout=[DT_UINT8, DT_FLOAT, DT_BOOL], token="pyfunc_7", _device="/job:localhost/replica:0/task:0/device:CPU:0"](graph/collection/should_collect_cheetah_run/simulate-1/train-cheetah_run-cem-12/scan/while/simulate/Identity_5/_847)]]
     [[{{node graph/summaries/general/sub/ReadVariableOp/_441}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_4356_graph/summaries/general/sub/ReadVariableOp", tensor_type=DT_DOUBLE, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

It seems the vizualizations cannot start. With debug configuration, it runs till 15th step and then crashes. I am running with --config debug, so this kicks in when the testing starts. Maybe it has something to do with workers? Or maybe is it the transition from train to test?

Could you specify these things:

All the versions of "install_requires" dependencies.
Used dmcontrol rendering option (https://github.com/deepmind/dm_control#rendering)
[if nvidia] Nvidia drivers version.
OS and its version.
Mujoco Pro version.

Thank you :)

@JamesLuoau have you managed to fix this?

danijar commented 5 years ago

Thanks for letting me know. I will look into this but it will take a couple of days before I get to it. For now, I think everything works using the previous version of TensorFlow and TensorFlow Probability. I mentioned the versions for this above. I'm using the egl rendering option for dm_control.

astronautas commented 5 years ago

Thanks for letting me know. I will look into this but it will take a couple of days before I get to it. For now, I think everything works using the previous version of TensorFlow and TensorFlow Probability. I mentioned the versions for this above. I'm using the egl rendering option for dm_control.

Thanks @danijar 🥇. I use Tensorflow 1.12.0 and TF Prob 0.5.0 as well. I am starting to think this might be OS configuration issue, who knows 🤷‍♂️. Please, share Mujoco Pro, dmcontrol and mujoco py version too, as it's stated on mujoco py repo that it needs mujopro 1.5.0, yet dmcontrol depends on 2.0.0 version. And knowing your nvidia driver's version would be nice as well, as egl needs recent Nvidia drivers to work properly.

jluo-bgl commented 5 years ago

Hi, @astronautas, I haven't found a good way to run it yet. as you mentioned, I have to use mujoco 2.0.0

danijar commented 5 years ago

@astronautas and @JamesLuoau To debug this further, could you please confirm that you can create a dm_control environment and call render on it (outside of the PlaNet code)? I have both mjpro150 and mjpro200_linux installed on my machine but I think only the latter is used by dm_control. The PlaNet code is independent of the dm_control render option and should work will all of them as long as they support multi-threading -- I've used multiple options at some point.

lunar24 commented 5 years ago

hi,@danijar. As for configuration, I installed mujoco-py after installing mujoco 150. Then reinstall mujoco 200 and install dmcontrol. This is the only way I can think of. I don't know if there will be any problems. In addition, when I run test_planet with pycharm, I get AttributeError:'PlanetTest'object has no attribute'create_tempdir' error. I haven't found a suitable solution online. I wonder if you know where the problem is. Thanks very much.

danijar commented 5 years ago

@lunar24 You can install multiple MuJoCo versions by placing them into ~/.mujoco/. You also don't need mujoco-py to run the code as dm_control comes with its own bindings. To see if the code works, just run the command provided in the readme. If you do want to run the tests you need to call them as e.g. python3 -m scripts.test_planet.

astronautas commented 5 years ago

@astronautas and @JamesLuoau To debug this further, could you please confirm that you can create a dm_control environment and call render on it (outside of the PlaNet code)? I have both mjpro150 and mjpro200_linux installed on my machine but I think only the latter is used by dm_control. The PlaNet code is independent of the dm_control render option and should work will all of them as long as they support multi-threading -- I've used multiple options at some point.

I'll have some time this weekend for this. I'll post back the results.

lunar24 commented 5 years ago

@danijar Hi, when I was running the program, I encountered an error about the process. Here are some error hints.

I have checked a lot of information about this mistake, but I have not solved this problem. On the other hand, I am concerned that changing the code of the calling process may cause other problems. Therefore, I hope to get your advice. Thank you very much for your help.

astronautas commented 5 years ago

@danijar I can confirm these things at the moment:

dmcontrol.viewer works (https://github.com/deepmind/dm_control/blob/master/dm_control/viewer/README.md). I use the glfw and glew rendering option.
Rendering does work. Here are the screenshots, an environment snapshot as well as some of printed-out rewards. The code I've used is as follows:

from dm_control import suite
import numpy as np
from dm_control import viewer
from threading import Thread
import cv2

def rewards(env):
  # Step through an episode and print out reward, discount and observation.
  action_spec = env.action_spec()
  time_step = env.reset()

  while not time_step.last():
    action = np.random.uniform(action_spec.minimum,
                              action_spec.maximum,
                              size=action_spec.shape)

    time_step = env.step(action)
    img = env.physics.render()

    cv2.imshow("img", img)
    cv2.waitKey(0)

    print(time_step.reward)

  print("END")

# Load one task:
env = suite.load(domain_name="cartpole", task_name="swingup")

# Iterate over a task set:
for domain_name, task_name in suite.BENCHMARKING:
  env = suite.load(domain_name, task_name)

thread = Thread(target=lambda: rewards(env)).start()

# viewer.launch(env)

Environment:

Ubuntu 16.04
Hardware rendering with a windowing system is supported via GLFW and GLEW
Mujoco Pro 2.0.0
Everything else is the same as yours.

@danijar, could you please verify again that the code works both with the ExternalProcess and without using it for the environment? I suspect that launching the environment in a separate process alleviates the rendering problem, as based on the logs, the problem is that the current context is set in multiple threads in the same process. Though, neither me nor @JamesLuoau can successfully launch the environment in a separate process.

EDIT: correct me if I'm wrong, that's how I see the current implementation:

There are 2 processes communicating with each other: training_process <-----> worker (environment).

Problem: It seems that when the episodes are collected, the environment process never receives the last reset message (the one just before close message). It is sent and received when the training starts yet sent & never received when the epoch is about to be ended.

Maybe there's something incorrect with how the external methods on the environment get called? I am not sure whether that's the case but could you verify whether the environment process always writes to its end of pipe while the reinforcement learning process always writes to its own end of pipe?

danijar commented 5 years ago

@astronautas and @JamesLuoau Let's move this conversation over to #5 since the thread here got a bit confusing. I've responded to your questions there.

@lunar24 Thanks for reporting this. To keep the threads focused, I started a new ticket for your issue: #6. Please provide the details I asked for there so we can try to resolve this.

google-research / planet

Multiprocessing EOF error #2