tensorflow / agents

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.
Apache License 2.0
2.79k stars 722 forks source link

Error using a trained PPO policy #482

Open gbuonamico opened 4 years ago

gbuonamico commented 4 years ago

Hello, I'm trying to use a PPO tf-agent with a trained policy, but I get the following error


ValueError Traceback (most recent call last)

in () 1 ----> 2 evaluate(environment_eval, eval_env, eval_policy, num_episodes=3) in evaluate(py_environment, tf_environment, policy, num_episodes) 41 42 while not time_step.is_last(): ---> 43 action, policy_state, _ = policy.action(time_step, policy_state) 44 time_step = tf_environment.step(action) 45 print(py_environment.render()) /Users/luca/venv/lib/python3.7/site-packages/tf_agents/policies/tf_policy.py in action(self, time_step, policy_state, seed) 276 277 if self._automatic_state_reset: --> 278 policy_state = self._maybe_reset_state(time_step, policy_state) 279 step = action_fn(time_step=time_step, policy_state=policy_state, seed=seed) 280 /Users/luca/venv/lib/python3.7/site-packages/tf_agents/policies/tf_policy.py in _maybe_reset_state(self, time_step, policy_state) 241 # time_step in the sequence as we can't easily generalize how the policy is 242 # unrolled over the sequence. --> 243 if nest_utils.get_outer_rank(time_step, self._time_step_spec) > 1: 244 condition = time_step.is_first()[:, 0, ...] 245 return nest_utils.where(condition, zero_state, policy_state) /Users/luca/venv/lib/python3.7/site-packages/tf_agents/utils/nest_utils.py in get_outer_rank(tensors, specs) 524 'Saw tensor_shapes:\n %s\n' 525 'And spec_shapes:\n %s' % --> 526 (num_outer_dims, tensor_shapes, spec_shapes)) 527 528 ValueError: Received a mix of batched and unbatched Tensors, or Tensors are not compatible with Specs. num_outer_dims: 1. Saw tensor_shapes: [TensorShape([1]), TensorShape([1]), TensorShape([1]), TensorShape([1, 960, 18])] And spec_shapes: [TensorShape([]), TensorShape([]), TensorShape([]), TensorShape([1, 960, 18])] ************************************************************************************* _**Here are my agent and Networks definitions**_ ************************************************************************************* def create_networks(tf_env,conv_layer_params): actor_net = ActorDistributionRnnNetwork( tf_env.observation_spec(), tf_env.action_spec(), conv_layer_params=None, input_fc_layer_params=(200,100), lstm_size=(200,100), output_fc_layer_params=None ) value_net = ValueRnnNetwork( tf_env.observation_spec(), conv_layer_params=None, input_fc_layer_params=(200,100), lstm_size=(200,100), output_fc_layer_params=None, activation_fn=tf.nn.elu ) return actor_net, value_net actor_net, value_net = create_networks(tf_env,conv_layer_params) agent = ppo_agent.PPOAgent( tf_env.time_step_spec(), tf_env.action_spec(), optimizer, actor_net=actor_net, value_net=value_net, num_epochs=num_epochs, gradient_clipping=0.2, entropy_regularization=1e-2, importance_ratio_clipping=0.2, use_gae=True, use_td_lambda_return=True ) agent.initialize() eval_policy = agent.policy replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer( agent.collect_data_spec, batch_size=tf_env.batch_size, max_length=replay_buffer_capacity) policy_checkpointer = common.Checkpointer( ckpt_dir=policy_dir, policy=eval_policy, global_step=global_step) policy_checkpointer.initialize_or_restore() ************************************************** _**and I use the following function for getting actions value (with the error I mentioned when I call it)**_ ************************************************** def evaluate(py_environment: PyEnvironment, tf_environment: TFEnvironment, policy: tf_policy.Base, num_episodes=10): for episode in range(num_episodes): logging.info("Generating episode %d of %d" % (episode, num_episodes)) state = policy.get_initial_state(tf_environment.batch_size) time_step = tf_environment.reset() policy_state = policy.get_initial_state(tf_environment.batch_size) while not time_step.is_last(): action, policy_state, _ = policy.action(time_step, policy_state) time_step = tf_environment.step(action) print(py_environment.render()) evaluate(environment_eval, eval_env, eval_policy, num_episodes=3) ********************************************** **_any idea? thank you_**
summer-yue commented 4 years ago

Should your observation spec in your environment be TensorShape([960, 18]) instead of TensorShape([1, 960, 18])?

gbuonamico commented 4 years ago

Hello, thank you for replying. No because using LSTM in actor and value Network needs an additional dimension. The train works fine. This is the portion of code I use for training


def train_step(): trajectories = replay_buffer.gather_all() return tf_agent.train(experience=trajectories)

collect_time = 0 train_time = 0 time_step = None timed_at_step = global_step.numpy() while environment_steps_metric.result() < num_environment_steps: current_metrics = [] start_time = time.time()

collect_driver.run()

      collect_driver.run()
      collect_time += time.time() - start_time
      start_time = time.time()
      total_loss, _ = train_step()
      replay_buffer.clear()
      train_time += time.time() - start_time

gbuonamico commented 4 years ago

Hello, any suggestion will be appreciated....

summer-yue commented 4 years ago

Sorry about the delay. Let me take a closer look at this today afternoon.

gbuonamico commented 4 years ago

Not a problem. I think the problem come from the use of the wrapper "train_step=common.function(train_step)" in the training phase. keep this in mind tomorrow and let me know please...

summer-yue commented 4 years ago

The ValueError you're seeing is saying that your input time_step into policy.action is not aligned with the spec it is expecting. Could you try not including the additional dimension in your spec for observations, despite it being LSTM? I think the agent handles that by checking the network later on.

I also tried running your code on Cartpole and it finished successfully, so the issue seems like with the environment.

Adding @oars who's more knowledgable than me this front to confirm.

gbuonamico commented 4 years ago

I agree on the fact that time_step and policy_state are not aligned. time_step is batched, while policy_state (initial state) is not.

If I not include the additional dimension in the observation space I got the following error "ValueError: Shapes (960, 18) and (1, 960, 18) are incompatible" while trying to load the checkpoint (which is what I expected).

If you look at the error, both tensors has the right dimension for the observation (1,960,18) but they differ on the fact that time_step is batched (has dimension [1] in the three fields before observation shape) and policy_state is not (has dimension [] in the same fields)


ValueError: Received a mix of batched and unbatched Tensors, or Tensors are not compatible with Specs. num_outer_dims: 1. Saw tensor_shapes: [TensorShape([1]), TensorShape([1]), TensorShape([1]), TensorShape([1, 960, 18])] And spec_shapes: [TensorShape([]), TensorShape([]), TensorShape([]), TensorShape([1, 960, 18])]


My question is how can I add this batch dimension to policy_state observation?

Additional note : in the train phase (which is working fine), I got the same error while loading the checkpoint if I do not use common.function for the train_step and agent.train....

summer-yue commented 4 years ago

Could you remove the extra dimension in you spec (not in your observation), such that the spec_shapes is [960, 18] and your observation is still [1, 960, 18]?

gbuonamico commented 4 years ago

Sorry but I don't understand what you mean. In my environment the definition of action_spec and observation_spec are the following


**self._action_spec = array_spec.BoundedArraySpec( shape=(), dtype=np.int32, minimum=0, maximum=2, name='action')

ns=(1,self.shape[0],self.shape[1]) self._observation_spec = array_spec.BoundedArraySpec( shape=ns, dtype=np.float32, name='observation')**


where self.shape[0] and self.shape[1] are dimension given in input (960,18).

This are the only "spec" definitions I have in my environment.

Do you mind to be a little bit more specific, please?

summer-yue commented 4 years ago

Sure. I was suggesting to modify your observation spec to:

# Note that we are removing the extra 1 at the front here.
ns=(self.shape[0],self.shape[1])
self._observation_spec = array_spec.BoundedArraySpec(
shape=ns, dtype=np.float32, name='observation')

And keep the actual observation data as what you had before.

gbuonamico commented 4 years ago

Thats what I did as you suggested, and where I got the error I was talking about in my previous comment

"ValueError: Shapes (960, 18) and (1, 960, 18) are incompatible"

summer-yue commented 4 years ago

Are you able to provide code to your environment? Better if it's not too complicated. As I cannot reproduce the issue you're seeing in standard environments, it's a bit hard to debug from my end. Thanks!

gbuonamico commented 4 years ago

Well, that s not possible as the environment needs a database and additional procedures to run. But it's a standard python environment wrapped into a TFEnvironment. No changes are made to the action_spec and observation_spec you have seen before. Just for my understanding (then I will stop bother you..): the error message points the difference in the dimension of (in Bold) Saw tensor_shapes: [TensorShape([1]), TensorShape([1]), TensorShape([1]), TensorShape([1, 960, 18])] And spec_shapes: [TensorShape([]), TensorShape([]), TensorShape([]), TensorShape([1, 960, 18])],

while the shapes of the observation are both ok for tensor_shapes and spec_shapes ([1, 960, 18] in both). This is, for me, something which is not related with the trained agent, but maybe in the policy saver or a wrapper used (like common.function), but my knowledge of this functions is quite limited Again thank you for your time

summer-yue commented 4 years ago

Sorry that the previous suggestions weren't as helpful as I wished. I think I might understand where the confusion is. Let me try again.

The error message shows that the received tensor shape and the spec shape to be "not compatible" - though the word compatible isn't very well defined. If you look closer at the code where it's erring out in nest_utils.is_batched_nested_tensors, you will notice that tensor_shapes and spec_shapes are not required to be exactly the same. Both cases below are considered compatible:

  1. tensor_shapes and spec_shapes are completely aligned, both unbatched.
  2. tensor_shapes has one or more extra dimensions than spec_shapes in every field. For example, tensor_shape is [TensorShape([1]), TensorShape([1]), TensorShape([1]), TensorShape([1, 960, 18])] but spec_shape is [TensorShape([]), TensorShape([]), TensorShape([]), TensorShape([960, 18])], - note that the tensor_shape has one extra batch dimension. This is compatible. Hence earlier I suggested to reduce space_shape.observation to [960, 18] from [1, 960, 18] to make it compatible with your batched observation (batch_num=1) used in the eval code.

I might see why you think it's a policy saver or wrapper issue. It's possible. Maybe I didn't understand your issue very well. To clarify, in your code, before you save and reload the policy, if you just call evaluate on agent.policy right after training, do you see the same issue? If not, it would point to a bug in PolicySaver. I think it is extremely unlikely that the common.function wrapper would change the spec dimensions.

gbuonamico commented 3 years ago

Thank you for your answer. At the end I m using this workaround (in bold in the code. Not sure is great, but seems to work)

t_step = tf_environment.reset() t_step=tf.expand_dims(t_step.observation,axis=0) time_step = ts.restart(t_step, tf_environment.batch_size) state = policy.get_initial_state(tf_environment.batch_size) i=0 while not time_step.is_last(): policy_step: PolicyStep = policy.action(time_step, state) state = policy_step.state time_step = tf_environment.step(policy_step.action) time_step=tf.expand_dims(time_step.observation,axis=0) time_step = ts.restart(time_step,batch_size=tf_environment.batch_size) if (i%500==0): print(py_environment.render(),'Run:',i, 'Action', policy_step.action.numpy()) i+=1

gbuonamico commented 3 years ago

But I remain frustrated for not really understanding what's the root problem.... Thank you for your patience