tensorflow / agents

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.
Apache License 2.0
2.79k stars 721 forks source link

Epsilon greedy policy error when generating new action from Dict Action space. #276

Open JaCoderX opened 4 years ago

JaCoderX commented 4 years ago

I'm trying to convert a custom gym project (called BTgym) to work as a tf-agent env. the original observation space and the action space are both gym.spaces.Dict. but for the moment I have simplified the observation space so I can fit the env to run using the same code of DQN tutorial example (as a proof of concept). so the modified spaces are as follows:

Observation Spec:
BoundedTensorSpec(shape=(6, 1, 5), dtype=tf.float32, name='observation/external', minimum=array(-100., dtype=float32), maximum=array(100., dtype=float32))
Action Spec:
OrderedDict([('default_asset', BoundedTensorSpec(shape=(), dtype=tf.int64, name='action/default_asset', minimum=array(0), maximum=array(3)))]) 

Error occur under Training the agent section, when performing collect_step():

Traceback (most recent call last):
  File "/home/jack/envTest.py", line 251, in <module>
    collect_step(train_env, agent.collect_policy, replay_buffer)
  File "/home/jack/envTest.py", line 209, in collect_step
    action_step = policy.action(time_step)
  File "/home/jack/tf_agents/policies/tf_policy.py", line 278, in action
    step = action_fn(time_step=time_step, policy_state=policy_state, seed=seed)
  File "/home/jack/tf_agents/utils/common.py", line 131, in with_check_resource_vars
    return fn(*fn_args, **fn_kwargs)
  File "/home/jack/tf_agents/policies/epsilon_greedy_policy.py", line 106, in _action
    random_action.action)
  File "/home/jack/anaconda3/envs/deep/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/home/jack/anaconda3/envs/deep/lib/python3.6/site-packages/tensorflow_core/python/ops/array_ops.py", line 3753, in where
    return gen_math_ops.select(condition=condition, x=x, y=y, name=name)
  File "/home/jack/anaconda3/envs/deep/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 9430, in select
    condition, x, y, name=name, ctx=_ctx)
  File "/home/jack/anaconda3/envs/deep/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 9462, in select_eager_fallback
    _attr_T, _inputs_T = _execute.args_to_matching_eager([x, y], _ctx)
  File "/home/jack/anaconda3/envs/deep/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 257, in args_to_matching_eager
    t, dtype, preferred_dtype=default_dtype, ctx=ctx))
  File "/home/jack/anaconda3/envs/deep/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1296, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/jack/anaconda3/envs/deep/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 286, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "/home/jack/anaconda3/envs/deep/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 227, in constant
    allow_broadcast=True)
  File "/home/jack/anaconda3/envs/deep/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 235, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/home/jack/anaconda3/envs/deep/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 96, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
ValueError: Attempt to convert a value (DictWrapper({'default_asset': <tf.Tensor: id=132834, shape=(1,), dtype=int64, numpy=array([0])>})) with an unsupported type (<class 'tensorflow.python.training.tracking.data_structures._DictWrapper'>) to a Tensor.

it seems that epsilon greedy policy have some problem with the Dict action space when trying to generate action action_step = policy.action(time_step) as both DQN and random agents seems to work fine for producing actions, Dict space in epsilon greedy policy seem not to be supported.

any idea on how to resolve this?

JaCoderX commented 4 years ago

I'm still not sure on how to add support for Dict action space.

The error occur on trying to get the action from greedy_action (greedy_policy is source) action = tf.compat.v1.where(cond, greedy_action.action, random_action.action)

ebrevdo commented 4 years ago

That line needs to be rewritten as:

action = tf.nest.map_structure(lambda g, r: tf.compat.v1.where(cond, g, r), greedy_action.action, random_action.action)

Report back and let us know if this works. We can patch it on our end.

JaCoderX commented 4 years ago

@ebrevdo, I tested it on my end and it works well. thank you :)