uber-research / deep-neuroevolution

Deep Neuroevolution
Other
1.63k stars 301 forks source link

[es_distributed/tf_util.py] ValueError: Dimension 0 in both shapes must be equal, but are 4608 and 18. Shapes are [4608] and [18]. From merging shape 12 with other shapes. for 'concat/concat_dim' (op: 'Pack') with input shapes: [4096], [16], [16], [16], [8192], [32], [32], [32], [991232], [ 256], [256], [256], [4608], [18]. #31

Open dragon28 opened 5 years ago

dragon28 commented 5 years ago

Hello People,

I managed to find some error, when I tested the ES algorithm.

python3 -m es_distributed.main master --master_socket_path /tmp/es_redis_master.sock --algo es --exp_file configurations/frostbite_es.json                       [67/67]
file configurations/frostbite_es.json /tmp/es_redis_master.sock --algo es --exp_f                                                                                       
[2019-06-04 23:15:37,056 pid=22170] run_master: {'exp': {'config': {'calc_obstat_prob': 0.0, 'episodes_per_batch': 5000, 'eval_prob': 0.01, 'l2coeff': 0.005, 'noise_std
ev': 0.005, 'snapshot_freq': 20, 'timesteps_per_batch': 10000, 'return_proc_mode': 'centered_rank', 'episode_cutoff_mode': 5000}, 'env_id': 'FrostbiteNoFrameskip-v4', '
optimizer': {'args': {'stepsize': 0.01}, 'type': 'adam'}, 'policy': {'args': {}, 'type': 'ESAtariPolicy'}}, 'log_dir': '/tmp/es_master_22170', 'master_redis_cfg': {'uni
x_socket_path': '/tmp/es_redis_master.sock'}}                                                                                                                           
[2019-06-04 23:15:38,083 pid=22170] Tabular logging to /tmp/es_master_22170                                                                                             
2019-06-04 23:15:38.894940: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3410300000 Hz                                                      
2019-06-04 23:15:38.895374: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x40bfa50 executing computations on platform Host. Devices:                   
2019-06-04 23:15:38.895393: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>                                     
[2019-06-04 23:15:38,904 pid=22170] From /home/dragon/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorfl
ow.python.framework.ops) is deprecated and will be removed in a future version.                                                                                         
Instructions for updating:                                                                                                                                              
Colocations handled automatically by placer.                                                                                                                            
[2019-06-04 23:15:38,991 pid=22170] From /home/dragon/.local/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py:1624: flatten (from tensorflo
w.python.layers.core) is deprecated and will be removed in a future version.                                                                                            
Instructions for updating:                                                                                                                                              
Use keras.layers.flatten instead.                                                                                                                                       
[2019-06-04 23:15:39,054 pid=22170] From /home/dragon/.local/lib/python3.6/site-packages/tensorflow/python/util/decorator_utils.py:145: GraphKeys.VARIABLES (from tensor
flow.python.framework.ops) is deprecated and will be removed in a future version.                                                                                       
Instructions for updating:                                                                                                                                              
Use `tf.GraphKeys.GLOBAL_VARIABLES` instead.
Traceback (most recent call last):                                                                                                                               [45/67]
  File "/home/dragon/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1659, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 4608 and 18. Shapes are [4608] and [18].
        From merging shape 12 with other shapes. for 'concat/concat_dim' (op: 'Pack') with input shapes: [4096], [16], [16], [16], [8192], [32], [32], [32], [991232], [
256], [256], [256], [4608], [18].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/dragon/machine_learning/deep-neuroevolution/es_distributed/main.py", line 90, in <module>
    cli()
  File "/home/dragon/.local/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/dragon/.local/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/dragon/.local/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/dragon/.local/lib/python3.6/site-packages/click/core.py", line 956, in invoke                                                                      [23/67]
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/dragon/.local/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/dragon/machine_learning/deep-neuroevolution/es_distributed/main.py", line 61, in master
    algo.run_master({'unix_socket_path': master_socket_path}, log_dir, exp)
  File "/home/dragon/machine_learning/deep-neuroevolution/es_distributed/es.py", line 147, in run_master
    config, env, sess, policy = setup(exp, single_threaded=False)
  File "/home/dragon/machine_learning/deep-neuroevolution/es_distributed/es.py", line 136, in setup
    policy = getattr(policies, exp['policy']['type'])(env.observation_space, env.action_space, **exp['policy']['args'])
  File "/home/dragon/machine_learning/deep-neuroevolution/es_distributed/policies.py", line 24, in __init__
    self._getflat = U.GetFlat(self.trainable_variables)
  File "/home/dragon/machine_learning/deep-neuroevolution/es_distributed/tf_util.py", line 244, in __init__ 
    self.op = tf.concat(0, [tf.reshape(v, [numel(v)]) for v in var_list])                                     
  File "/home/dragon/.local/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/home/dragon/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1253, in concat
    dtype=dtypes.int32).get_shape().assert_is_compatible_with(
  File "/home/dragon/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1039, in convert_to_tensor
    return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
  File "/home/dragon/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1097, in convert_to_tensor_v2
    as_ref=False)
File "/home/dragon/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1175, in internal_convert_to_tensor                                   
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/dragon/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1102, in _autopacking_conversion_function
    return _autopacking_helper(v, dtype, name or "packed")
  File "/home/dragon/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1054, in _autopacking_helper
    return gen_array_ops.pack(elems_as_tensors, name=scope)
  File "/home/dragon/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5448, in pack
    "Pack", values=values, axis=axis, name=name)
  File "/home/dragon/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/dragon/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/dragon/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/dragon/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1823, in __init__
    control_input_ops)
  File "/home/dragon/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1662, in _create_c_op
    raise ValueError(str(e))
ValueError: Dimension 0 in both shapes must be equal, but are 4608 and 18. Shapes are [4608] and [18].
From merging shape 12 with other shapes. for 'concat/concat_dim' (op: 'Pack') with input shapes: [4096], [16], [16], [16], [8192], [32], [32], [32], [991232], [
256], [256], [256], [4608], [18].

Most of these errors were related to es_distributed/tf_util.py file which was originated from the tf.concat function or method.

Below were some of the changes:

1) def concatenate(arrs, axis=0) function at line 30 - 31

from:

def concatenate(arrs, axis=0):
    return tf.concat(axis, arrs)

to:

def concatenate(arrs, axis=0):
    return tf.concat(arrs, axis)

2) def flatgrad(loss, var_list) function at line 219 - 222

from:

def flatgrad(loss, var_list):
    grads = tf.gradients(loss, var_list)
    return tf.concat(0, [tf.reshape(grad, [numel(v)])
        for (v, grad) in zip(var_list, grads)])

to:

def flatgrad(loss, var_list):
    grads = tf.gradients(loss, var_list)
    return tf.concat([tf.reshape(grad, [numel(v)], 0)
        for (v, grad) in zip(var_list, grads)])

3) def __init__(self, var_list) function at line 243 -244

from:

def __init__(self, var_list):
        self.op = tf.concat(0, [tf.reshape(v, [numel(v)]) for v in var_list])

to:

def __init__(self, var_list):
        self.op = tf.concat([tf.reshape(v, [numel(v)]) for v in var_list], 0)

My environment information: Ubuntu 18.04 x64 Python 3.6.8 tensorflow 1.13.1 Click 7.0 atari-py 0.1.15 numpy 1.16.3 gym 0.12.1 baselines 0.1.5

Thanks

EmanueleLM commented 5 years ago

This error depends on the version of TensorFlow you use. With the required one (i.e 0.12.1) it works without changing the codebase. I also confirm that with the latest versions TensorFlow your solution solves the issue.