Open parhartanvir opened 6 years ago
Hi @parhartanvir! To answer your question directly, I am pretty sure the actions (or policy outputs) are not being clipped anywhere in the ppo1 (putting @joschu as the source of the ground truth in the loop here). Small actions are probably related to the fact that output layer of MlpPolicy is initialized with small variance, 0.01, here: https://github.com/openai/baselines/blob/1f99a562e3df9eaad96b59d44677006ef54ca1c2/baselines/ppo1/mlp_policy.py#L37 So even though policy can learn to output large actions, if there is not enough reward signal with small actions, the policy will never learn to do so.
apart from that, I'd like to point out that ppo2 should work with legacy environments too, and I'd be much happier debugging issues with ppo2 given that we are deprecating ppo1 :)
Therefore, what is the connections between lower and upper bounds of the action space with the baselines? I put there "okay-ish" values, and I'm worrying that something's been done internally. Also, is there a chatroom (IRC, Gitter, Slack, anything) where we have a direct line to contact you, guys?
@Atcold Not sure I understand the question... you mean given the lower and upper bounds of action space, how to ensure that policy can explore the whole space? The simplest way would be, I guess, to add an action wrapper that scales actions by constant factors (one constant per dimension). More principled approach would be to modify action distribution according to the dimensions of the action space. We did a bit of work on using beta distributions instead of diagonal gaussian in case of continuous action spaces with finite limits. In mujoco, that did not make any difference; but it is possible that in some cases that would be a decent solution.
About the chatroom - don't think such thing exists; I'll bring up a possibility of setting something like that up with the team.
@pzhokhov, my question simply was "Are lower and upper action space bounds used anywhere in the code?". Yeah, a IRC/Gitter channel would be AMAZING!
I am trying to learn to control a UR5 arm using PPO1(because of my legacy environment issue), rather than PPO2. I have ported the code from run_humanoid.py example, with similar hyper-parameters. The problem is, even if the control range of my model is from -5 to 5, the MlpPolicy that I am using to train generates very small values, (-0.2 to 0.2). Do the trajectory generator or policy clip these values somewhere internally?
Thank you, Tanvir