While the actions for the evaluation environment in its initial run are within the (-0.2, 0.2) bounds, the actions for the collect environment that is triggered once the evaluation environment finishes go way over the bounds, reaching 1.0. Is this supposed to happen? How can one control it?
I have two environments to collect and evaluate trajectories. The
action_tensor_spec
is the following:The agent is a
PPOClipAgent
object defined aswhere the actor and value networks are defined as
While the actions for the evaluation environment in its initial run are within the (-0.2, 0.2) bounds, the actions for the collect environment that is triggered once the evaluation environment finishes go way over the bounds, reaching 1.0. Is this supposed to happen? How can one control it?
Thanks!