tensorflow / agents

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.
Apache License 2.0
2.77k stars 714 forks source link

Actor network predicts actions over bounds using PPOClipAgent #847

Open b-fg opened 1 year ago

b-fg commented 1 year ago

I have two environments to collect and evaluate trajectories. The action_tensor_spec is the following:

BoundedTensorSpec(shape=(5,), dtype=tf.float32, name='action', minimum=array(-0.2, dtype=float32), maximum=array(0.2, dtype=float32))

The agent is a PPOClipAgent object defined as

agent = ppo_clip_agent.PPOClipAgent(
    time_step_tensor_spec,
    action_tensor_spec,
    optimizer=optimizer,
    actor_net=actor_net,
    value_net=value_net,
    importance_ratio_clipping=0.2,
    discount_factor=0.99,
    entropy_regularization=0.0,
    normalize_observations=False,
    normalize_rewards=False,
    use_gae=True,
    num_epochs=10,
    train_step_counter=global_step)

where the actor and value networks are defined as

actor_net_builder = ppo_actor_network.PPOActorNetwork()
actor_net = actor_net_builder.create_sequential_actor_net((256, 256), action_tensor_spec)
value_net = value_network.ValueNetwork(
    observation_tensor_spec,
    fc_layer_params=(256, 256),
    kernel_initializer=tf.keras.initializers.Orthogonal())

While the actions for the evaluation environment in its initial run are within the (-0.2, 0.2) bounds, the actions for the collect environment that is triggered once the evaluation environment finishes go way over the bounds, reaching 1.0. Is this supposed to happen? How can one control it?

Thanks!

b-fg commented 1 year ago

Pinging @m-kurz in case you know something about it :)

b-fg commented 11 months ago

Any update on this? Thanks.