Open hamzamerzic opened 7 years ago
UPDATE: Actually, not all of the environments do the clipping automatically. I just came across this example: https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py#L238, so in this case the algorithms above crash. @joschu Could you confirm if this is a bug, or the intended behavior? I can put up a PR to fix it if it is.
This was very helpful for finding my issue. Thanks @hamzamerzic :)
@rayman813 I am glad I helped! :+1: Would you mind sharing more details of what the issue was and how you fixed it, in case others stumble upon the same problem?
What is the point of the action space having bounds if they are not respected? I have a custom environment where the action space is a Box(0.0, 1.0, n) space and want to use the PPO1/2 algo. This issue causes my env to crash as actions outside that range don't make sense. What is the actual output range of the actions in the PPO algos and how should I rescale - eg. is clipping or transforming with eg. a sigmoid better?
@aeoost the simplest way around this is probably to create a wrapper that clips actions to the valid range.
I'm with @aeoost. I've run into this problem as well, with the OpenAI baselines, and with other DRL implementations. In my opinion, it would make sense for the model to output actions into the correct range. The lower and upper bounds of the action_space
should be taken into account. It's part of the specification of the environments. The specification should be respected, otherwise, there is no point in having those bounds to begin with.
Hi. My opinion is:
@olegklimov In the link I posted in the comment you can see that the environment actually crashes when the action is outside of the limits, it does not penalize the actions. This means that baselines are not fully compatible with gym, if I did not miss something crucial here, like some wrapper that handles this automatically.
@olegklimov I still feel this issue should not be closed. There is clearly an inconsistency between baselines and gym. Should I open an issue with the gym instead?
@hamzamerzic @olegklimov Fully agree that this issue should not be closed. I think the fix belongs in baselines. It makes perfect sense for environments to have limits on the values continuous actions can take, and it should not be difficult to make a model that properly takes those into account. People shouldn't have to rely on hacks to get around this issue. Please fix the problem.
I agree that agents should respect action ranges, and that baselines does a bad job of this right now. We not only don't clip actions, we also don't stretch the policy outputs to the correct range. For example, if an action space has a large range like [-100, 100], our agents would still start by outputting values with stddev 1.
The question isn't really whether this is a problem--it's how to best fix the problem. I proposed doing the fix in a wrapper (which could be placed in baselines/common
). This seems like a hack, and it kind of is. However, it would be easy to drop the wrapper into every algorithm in baselines and have an immediate universal fix. All the other fixes I can think of would be algorithm-specific.
@maximecb what do you have in mind when you say the model should output values in the correct range? If you parameterize a policy as a Gaussian, it must be able to output values in [-inf, inf]. If you clip the outputs coming out of the policy and use these clipped values in a policy gradient update, the resulting gradient will be biased (since the log probs won't reflect the clipped pdf). No matter how you slice it, the policy must believe that it can take any action value it wants to, otherwise the policy gradient is wrong.
There are plenty of implementation-specific points to insert clipping (e.g. the argument to step() in PPO1). However, using a wrapper will not require changing every single implementation, while pretty much any other approach will.
As a side note, the Gaussian distribution is probably not ideal for these kinds of problems anyway. See, for example, this paper on using the Beta distribution in RL. The Beta distribution is bounded between 0 and 1, making it more appropriate for problems where the action space is constrained to an interval.
this paper on using the Beta distribution in RL. The Beta distribution is bounded between 0 and 1, making it more appropriate for problems where the action space is constrained to an interval.
That sounds like the right approach to me. If the beta distribution is bounded between 0 and 1, it will be easy to translate and scale that range appropriately. With a gaussian distribution which has infinite range, you can only have hacky fixes, and it will be hard to learn some action ranges.
Right, let's implement a wrapper because it is a correct thing to do.
LunarLander specifically I'd change to -Inf..Inf actions, it already penalizes "fuel usage".
I don't think action range supplied by env really has any meaning, actionable by agent. -Inf..Inf is a good example. We don't have a mechanism to "recommend" range to the agent (0..1 vs -1..+1 for example). And probably don't want, as this should be discoverable by agent automatically.
Could someone please point me to an implementation where this issue is handled.
IMO the paper pointed to by @unixpickle on the beta distribution for continuous RL is the best starting point. That author may be willing to share his implementation (if it isn't already on github).
At the moment,even a hacky implementation with clipping will also do.I am not sure how to convert the infinite range to a finite range with clipping
Is this the only change i need to handle ,that I clip the action space as below and assume that the algorithm eventually figures out to output actions in the right range?
action = np.clip(action, self.action_space.low, self.action_space.high)
@olegklimov It's still important to respect the action ranges supplied by the envs, even though they may be arbitrary as in the LunarLander case. The fact that LunarLander's internal reward signal includes a penalty for actions makes its action range particularly arbitrary/unnecessary; however, simply changing it to (-inf, inf) is problematic because 1) it assumes we have domain knowledge about the environment and 2) you can imagine it could also change the optimal policy. An action space range is simply a constraint of the problem, so the solution can't simply be to change it. Besides, real applications have continuous action spaces that are bounded, so we need algorithms that can deal with them and benchmark environments that can respect them.
A wrapper is a good place to start, though it should be recognized that it's a hack, since as @unixpickle pointed out it will bias the policy gradient. Though I suppose that you had a black-box environment that clipped your actions, you'd never know...
We don't have a mechanism to "recommend" range to the agent (0..1 vs -1..+1 for example).
This is precisely why implementing a Beta policy is the only non-hack solution that makes sense to me. You can do this "recommending" in a principled way with any distribution whose support is an interval (i.e. no infinity). Simply make action_space.low
correspond to the lower bound of the support and action_space.high
correspond to the upper bound. You could also use a Gamma for an environment with an action range of (0, inf), for example.
Is there currently any effort of implementing a beta policy as a baseline distribution? @brendenpetersen
@pmwenzel Not that I know of. I've started working on an implementation for an MlpBetaPolicy
class (analogous to MlpPolicy
, which uses a diagonal Gaussian policy) and associated distributions. I can share the fork with you.
@brendenpetersen Sure, that would be great.
@brendenpetersen Could you please share your implementation
@pmwenzel @zishanahmed08 I implemented a beta policy; feel free to try it out from my fork. Unfortunately, the baselines repository as a whole is not very modularized; for example, TRPO, PPO1, PPO2, and ACKTR all have their own policy implementations (with the lone exception of TRPO sharing PPO1's MLP policy), often with identical portions of code. I'm extremely uncomfortable with that; however, I also doubt they'd fold in a bunch of structural changes to their code if I did the modularization myself. So, for now I implemented the beta MLP policy as part of PPO2. It should be straightforward if not trivial to adapt to some of the other policy gradient algorithms.
Lastly, I included one hack, because sampling actions from the beta policy sometimes returned values of 0.
or 1.
(the very ends of the support of the beta distribution). I'm not sure why--perhaps it's a floating point precision issue. At any rate, this resulted in numerical issues for downstream calculations (like the likelihood ratio). The hack simply clips samples to [1e-5, 1 - 1e-5]
.
I'd like to reference Po-Wei Chou's thesis on the beta policy, on which I based the policy and which had some useful ideas like using a softplus activation for the beta shape parameters.
Potentially relevant: https://arxiv.org/abs/1802.07564
@brendenpetersen did you run any benchmarks on beta vs non beta (ideally hopper & Walker2d)? - Unity-ML has a beta implementation in progress and when I tested it on my UnityMojoco implementations it performed less well than venila PPO - https://github.com/Unity-Technologies/ml-agents/pull/581
@Sohojoe No, I only tested on LunarLander. I don't have a MuJoCo license.
Beta-PPO didn't really perform better than Gaussian-PPO even on LunarLander; however, it's not really the fairest comparison because hyperparameters were originally tuned using the Gaussian policy. For all we know Beta could severely outperform Gaussian if its hyperparameters were independently tuned.
I stumbled across the exact same problem, training LunarLanderContinuous-v2 with ppo1 baseline. @zishanahmed08 As you suggested, I added a single line
action = np.clip(action, self.action_space.low, self.action_space.high)
above the assertion of the action in lunar lander environment: lunarlandercode. And this solved the problem for me. This solution is most likely suboptimal.
Still having issues with LunarLanderContinuous-v2
... in the end what solutioins is better, to constrain the agent (e.g. PPO) to output actions within the legal range ? or to simply within the environments code correct for out of range actions via np.clip
?
Use clip. (either modify LunarLander or your code)
It is tested to work.
Problem is not 'correctness', problem is lack of gradient when action is clipped. But it is not a problem in this case, because fuel usage is punished in lunar lander, it's not beneficial to be at limit for a long time.
Hi all, this is the workaround/hack which I've come up in order to respect the environments (possibly asymmetric) action bounds. In the baselines/ddpg/training.py I've added a scaling of the actions before they are executed.
# ...
for t_rollout in range(nb_rollout_steps):
# Predict next action.
action, q = agent.pi(obs, apply_noise=True, compute_Q=True)
# action is an array with entries between -1 and 1
# scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
target = scale_range(action, -1, 1, env.action_space.low, env.action_space.high)
# target is an array with scaled actions
# Execute next action.
if rank == 0 and render:
env.render()
assert target.shape == env.action_space.shape
new_obs, r, done, info = env.step(target)
t += 1
if rank == 0 and render:
env.render()
# ...
def scale_range(x, x_min, x_max, y_min, y_max):
""" Scales the entries in x which have a range between x_min and x_max
to the range defined between y_min and y_max. """
# y = a*x + b
# a = deltaY/deltaX
# b = y_min - a*x_min (or b = y_max - a*x_max)
y = (y_max - y_min) / (x_max - x_min) * x + (y_min*x_max - y_max*x_min) / (x_max - x_min)
return y
This is just a simple linear scaling from the [-1, 1] range of the DDPG algorithm to the action range provided by the environment (e.g. [-3, 22.5]). It works for multiple action dimensions as well. I don't know if that's the way to go, but it worked for me (at least I could cope somehow with the asymmetric action bounds of my environment). If you have any concerns, feel free to comment on my solution.
Hi @brendenpetersen, I can't seem to find your fork with the beta distribution implementation. I'm having performance issues with clipping the bounds and was hoping to try your approach.
Update: I found Tensorforce has implemented a beta distribution, https://github.com/reinforceio/tensorforce/blob/master/tensorforce/core/distributions/beta.py
I would expect the env to be robust and not crash in the case of oob action. E.g. pressing up button has the same effect as pressing the up button harder - it's just not as efficient...
It's up to the author of the env to decide whether to punish oob actions, or handle/clip them.
For the author of the agent, using the action limits would accelerate training, but it's not a requirement. The agent will [eventually] learn that -5:5 has the same effect as -1:1, and will ignore [-5..-2]:[2..5] as having 'no benefit'.
However, the agent will struggle to learn this if the env crashes
What is the point of the action space having bounds if they are not respected? I have a custom environment where the action space is a Box(0.0, 1.0, n) space and want to use the PPO1/2 algo. This issue causes my env to crash as actions outside that range don't make sense. What is the actual output range of the actions in the PPO algos and how should I rescale - eg. is clipping or transforming with eg. a sigmoid better?
@ghost Did you solve your problem? I have the same issue, my action boundaries are between -500 500 and the actions from the network is changing between -3 sometimes 4. Are there anybody uses a mujoco environment with a large range of tork values? I clipped my actions and it is not enough. What is the output action range for PPO ? Is there any?
Hey. I have a similar problem. I'm building a custom environment to solve a research problem.
This is my observation space and every action changes these values, increasing or decreasing them.
self.observation_space = Box( low=np.array([0,100,200,300,400,500,600,700...]), high=np.array([1000,20000,3000,4000,5000,6000,7000...]), dtype=np.int64)
I'm using the observation space as a way to track the status of my agent in the environment. I've imported my environment in the Rllib library to use some algorithms. I can't go out of bounds but that was happening, making my algorithm fail. So i've implemented a check in the step() function that skips the action if it takes out of bounds the observation space and gives back a reward as in the previous step. Is there a cleaner way to do so, in your opinion? Maybe a good idea is to reduce the action_space when a value is at the edge of the domain, but how can i do that?
hi @fbbfnc what have I done to solve the problem is, I added an action_modifier() function to my env.py file. That is taking the action from the network and adjusting the action by multiplying with the numbers suitable for my environment after multiplication, I clipped the values according to my boundaries, and that worked for me. My agent is learning with TRPO.
Does action range clipping on environment really works well? My custom environment action range is far from [-1, 1] and when I apply clipping on environment, agent hit the clipped value near [-1, 1] side on and on. Because "proba_distribution_from_latent()" has initial parameters "init_scale=1.0, init_bias=0.0", initial action range is [-1, 1]. As training progresses, this range have to adaptively move toward the range which covers that of environment, but it doesn't seems act like that, because agent cannot explore due to clipping, no learning occurs. (stddev cannot role because of clipping) No exploring, no information gain, no learning. In my opinion, init_bias of linear function of "proba_distribution_from_latent()", which infers mean, should be in environment action range. Am I missing some point?
Hi @brendenpetersen
Very appreciate your work but where can I find your implementation of mlp beta policy? I checked your fork "stable-baseline" but did not find it in PPO2 or Common folders.
Hi @brendenpetersen Have you tested that sampling from beta can result in value 0 or 1? I am using tfp.distributions.Beta to get instance of the distribution. Have you done any parameters tuning for beta policy?
This is more of a question than an issue. I noticed that in the implementation of the above-mentioned algorithms, action limits are not taken into account. Environments handle this clipping internally, so no errors will appear, but this brings us to situations where the algorithm in it's training batch has inputs that were not necessarily applied within the environment.
For example, let's say the upper limit for an input is 1 and the applied input (given from the algorithm) is 5. What the environment will experience is input 1, due to it's internal clipping, but the algorithm will in it's training batch have data with action equal to 5.
Intuitively it makes sense that the algorithm will learn how to deal with this, but I am wondering if using the information of exactly what action is applied would be beneficial? Additionally, we can think of applying the action clipping even before adding noise (since noise doesn't really do anything if the mean is already out of limits). For example, DDPG handles this nicely with tanh outputs before applying the noise, and with clipping applied afterwards.