Extending for new (deterministic) environment error: 0-th value returned by pyfunc_6 is float, but expects uint8

piojanu commented 5 years ago

Hi again!

TL;DR: I know it's quite long, at the end is specific question I have. Here is an introduction to what I'm trying to accomplish and where I'm now.

I try to add new environment to your code: Sokoban. I know PlaNet originally is supposed to work with continuous control tasks, but what I'm interested in is the planning network. I then want to use it to train TD-Search algorithm using imagined episodes and latent state as an input. My research is investigating if other planning algorithms (like TD-Search, AlphaZero etc.) are able to: (1) use learned model and (2) use its abstract state representation as high-level features making it easier for them to learn (something like WorldModels proposed with decomposition of representation learning into Vision and Memory. I tried their architecture (here is my implementation), but it didn't work for Sokoban. I decided to drop it and try with PlaNet.).

I think, I'm on the right track, you can see it here: The Sokoban wrapper that makes action space "continuous" and resize the observation. It doesn't have to work great, I just want to run your code with this environment.

class SokobanWrapper(object):
    """Wraps a Sokoban environment into a continuous control task."""

    def __init__(self, env, size=(64, 64)):
        self._env = env
        self._size = size

    def __getattr__(self, name):
        return getattr(self._env, name)

    @property
    def observation_space(self):
        low = self._env.observation_space.low[0, 0, 0]
        high = self._env.observation_space.high[0, 0, 0]
        dtype = self._env.observation_space.dtype
        return gym.spaces.Box(low=low, high=high, shape=(*self._size, 3), dtype=dtype)

    @property
    def action_space(self):
        return gym.spaces.Box(low=0, high=1,
                              shape=(self._env.action_space.n,),
                              dtype=np.float32)

    def step(self, action):
        action = np.argmax(action)
        obs, reward, done, info = self._env.step(action)
        return self._preproc_obs(obs), reward, done, info

    def reset(self):
        obs = self._env.reset()
        return self._preproc_obs(obs)

    def render(self, *args, **kwargs):
        if kwargs.get('mode', 'rgb_array') != 'rgb_array':
            raise ValueError("Only render mode 'rgb_array' is supported.")
        del args  # Unused
        del kwargs  # Unused
        return self._env.render(mode='rgb_array')

    def _preproc_obs(self, obs):
        return skimage.transform.resize(
            obs, self._size, mode='edge', order=1, preserve_range=True)

Sokoban task and its factory function:

def sokoban(config, params):
    max_length = 120
    state_components = ['reward', 'image']
    env_ctor = _sokoban_env
    return Task('Sokoban', env_ctor, max_length, state_components)

def _sokoban_env():
    import gym_sokoban
    import gym

    def env_ctor():
        env = control.wrappers.SokobanWrapper(gym.make('Sokoban-v0'), (64, 64))
        env = control.wrappers.ObservationDict(env, 'image')
        env = control.wrappers.ConvertTo32Bit(env)
        return env
    env = control.wrappers.ExternalProcess(env_ctor)
    return env

I can run it with this command:

python3 -m planet.scripts.train \
--logdir logs \
--config debug \
--params '{tasks: [sokoban]}'

But it crashes in the first epoch (phase train) with this error:

Caused by op 'graph/collection/should_collect_Sokoban/simulate-1/train-Sokoban-cem-12/scan/while/simulate/cond_1/reset', defined at:
  File "/usr/local/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/piotr/Projects/Planning-in-Imagination/src/planet/scripts/train.py", line 130, in <module>
    tf.app.run(lambda _: main(args_), remaining)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/Users/piotr/Projects/Planning-in-Imagination/src/planet/scripts/train.py", line 130, in <lambda>
    tf.app.run(lambda _: main(args_), remaining)
  File "/Users/piotr/Projects/Planning-in-Imagination/src/planet/scripts/train.py", line 102, in main
    for unused_score in run:
  File "/Users/piotr/Projects/Planning-in-Imagination/src/planet/training/running.py", line 199, in __iter__
    for value in self._process_fn(self._logdir, *args):
  File "/Users/piotr/Projects/Planning-in-Imagination/src/planet/scripts/train.py", line 87, in process
    training.define_model, dataset, logdir, config):
  File "/Users/piotr/Projects/Planning-in-Imagination/src/planet/training/utility.py", line 160, in train
    score, summary = model_fn(data, trainer, config)
  File "/Users/piotr/Projects/Planning-in-Imagination/src/planet/training/define_model.py", line 133, in define_model
    name='should_collect_' + params.task.name)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2086, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1930, in BuildCondBranch
    original_result = fn()
  File "/Users/piotr/Projects/Planning-in-Imagination/src/planet/training/utility.py", line 254, in simulate_episodes
    1, agent_config, name=name)
  File "/Users/piotr/Projects/Planning-in-Imagination/src/planet/control/simulate.py", line 42, in simulate
    env_processes=env_processes)
  File "/Users/piotr/Projects/Planning-in-Imagination/src/planet/control/simulate.py", line 78, in collect_rollouts
    initializer, parallel_iterations=1)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/functional_ops.py", line 718, in scan
    maximum_iterations=n)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3291, in while_loop
    return_same_structure)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3004, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2939, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3260, in <lambda>
    body = lambda i, lv: (i + 1, orig_body(*lv))
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/functional_ops.py", line 697, in compute
    a_out = fn(packed_a, packed_elems)
  File "/Users/piotr/Projects/Planning-in-Imagination/src/planet/control/simulate.py", line 63, in simulate_fn
    reset=tf.equal(step, 0))
  File "/Users/piotr/Projects/Planning-in-Imagination/src/planet/control/simulate.py", line 217, in simulate_step
    lambda: (str(), score_var, length_var))
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2086, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1930, in BuildCondBranch
    original_result = fn()
  File "/Users/piotr/Projects/Planning-in-Imagination/src/planet/control/simulate.py", line 216, in <lambda>
    lambda: _define_begin_episode(agent_indices),
  File "/Users/piotr/Projects/Planning-in-Imagination/src/planet/control/simulate.py", line 132, in _define_begin_episode
    batch_env.reset(agent_indices), update_score, update_length]
  File "/Users/piotr/Projects/Planning-in-Imagination/src/planet/control/in_graph_batch_env.py", line 116, in reset
    self._batch_env.reset, [indices], observ_dtype, name='reset')
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 457, in py_func
    func=func, inp=inp, Tout=Tout, stateful=stateful, eager=False, name=name)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 281, in _internal_py_func
    input=inp, token=token, Tout=Tout, name=name)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_script_ops.py", line 129, in py_func
    "PyFunc", input=input, token=token, Tout=Tout, name=name)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): 0-th value returned by pyfunc_6 is float, but expects uint8
     [[node graph/collection/should_collect_Sokoban/simulate-1/train-Sokoban-cem-12/scan/while/simulate/cond_1/reset (defined at /Users/piotr/Projects/Planning-in-Imagination/src/planet/control/in_graph_batch_env.py:116)  = PyFunc[Tin=[DT_INT32], Tout=[DT_UINT8], token="pyfunc_6", _device="/job:localhost/replica:0/task:0/device:CPU:0"](graph/collection/should_collect_Sokoban/simulate-1/train-Sokoban-cem-12/scan/while/simulate/cond_1/zeros_like/Shape/Switch:1)]]

Could you provide me with any ideas where should I start the debug? I have to say, that I don't really get your code. I don't understand all of those experiments and runs and even model definition looks scary ;) I didn't work with such a big codebase in python yet.

One specific question that I have right now though is:, I don't understand what those state_components are used for (I know that those are some components of observation that dm_control returns) and what should I put in there in Sokoban factory function. I see that there are some heads created for each state component in define_model.py:55:

    # Instantiate network blocks.
    cell = config.cell()
    kwargs = dict()
    encoder = tf.make_template(
        'encoder', config.encoder, create_scope_now_=True, **kwargs)
    heads = {}
    for key, head in config.heads.items():
        name = 'head_{}'.format(key)
        kwargs = dict(data_shape=obs[key].shape[2:].as_list())
        heads[key] = tf.make_template(name, head, create_scope_now_=True, **kwargs)

I don't get it. In the paper there is nothing about state components. PlaNet is supposed to work on images, so what for are those heads then?

Greetings, Piotr

danijar commented 5 years ago

This sounds like an interesting project! Maybe the render function of your environment is returning a np.float32 array with elements between [0, 1] while it should be returning a np.uint8 array with elements between [0, 255]?

Only the image decoder and the reward state component contribute to training the model. You do not need any other state components. However, if you have true information about the environment state available, adding these could help with interpreting what the agent learns. The PlaNet code will try to predict these additional state components from a copy of the learned latent space (without contributing gradients back to the rest of the model) and add scalar and image summaries about the predictions to TensorBoard.

piojanu commented 5 years ago

In deed, my _preproc_obs method returned floats whereas it should return uint8 and my action space should be from -1 to 1 not from 0 to 1 ;) Thanks for hint, it runs now! I close this issue, but I have one more unrelated question: Before I'll go further, I would like to understand better your code (probably I'll try to extend it with other planners like I said, I don't want implement it myself). Do you have some resources that describes this code architecture etc.? Some documentation? It's hard to figure out what is going on from raw code 😮

EDIT: @danijar to be more specific: I'm the most confused with those Experiments and Runs in train.py. Could you elaborate on what gets executed in parallel (I've seen already that environments are batched, but what else, what are those experiments and runs that workers execute?) and how this messaging interface based on files works?

danijar commented 5 years ago

Great to hear. You don't need to worry about the Experiment and Run classes in training/running.py. They are just for running multiple (independent) experiments on a cluster with a limited number of machines. This can be useful for hyper parameter search but has nothing to do with parallel training. Please see https://github.com/google-research/planet/issues/3#issuecomment-471316616 for how to ignore this code.

piojanu commented 5 years ago

Okay, now I understand it more! Thank you :) So what is in deed parallelised per one experiment is environments execution (those are batched) and ...? There are multiple processes, the parameters training is distributed too or it runs on one process and other processes are used for data gathering only? Test runs are run in parallel too or in sequence? Please direct me to the place in code with some hint how it works and I should catch up with that a lot faster than on my own :)

danijar commented 5 years ago

There is no parallelization besides TensorFlow's thread pool. The data collection can be parallelized but I'm not using this so far and it's not a tested feature.

google-research / planet

Extending for new (deterministic) environment error: 0-th value returned by pyfunc_6 is float, but expects uint8 #9