openai / baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
MIT License
15.84k stars 4.88k forks source link

Taking action limits into account in PPO/TRPO/ACKTR. #121

Open hamzamerzic opened 7 years ago

hamzamerzic commented 7 years ago

This is more of a question than an issue. I noticed that in the implementation of the above-mentioned algorithms, action limits are not taken into account. Environments handle this clipping internally, so no errors will appear, but this brings us to situations where the algorithm in it's training batch has inputs that were not necessarily applied within the environment.

For example, let's say the upper limit for an input is 1 and the applied input (given from the algorithm) is 5. What the environment will experience is input 1, due to it's internal clipping, but the algorithm will in it's training batch have data with action equal to 5.

Intuitively it makes sense that the algorithm will learn how to deal with this, but I am wondering if using the information of exactly what action is applied would be beneficial? Additionally, we can think of applying the action clipping even before adding noise (since noise doesn't really do anything if the mean is already out of limits). For example, DDPG handles this nicely with tanh outputs before applying the noise, and with clipping applied afterwards.

hamzamerzic commented 7 years ago

UPDATE: Actually, not all of the environments do the clipping automatically. I just came across this example: https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py#L238, so in this case the algorithms above crash. @joschu Could you confirm if this is a bug, or the intended behavior? I can put up a PR to fix it if it is.

rayman813 commented 7 years ago

This was very helpful for finding my issue. Thanks @hamzamerzic :)

hamzamerzic commented 7 years ago

@rayman813 I am glad I helped! :+1: Would you mind sharing more details of what the issue was and how you fixed it, in case others stumble upon the same problem?

ghost commented 7 years ago

What is the point of the action space having bounds if they are not respected? I have a custom environment where the action space is a Box(0.0, 1.0, n) space and want to use the PPO1/2 algo. This issue causes my env to crash as actions outside that range don't make sense. What is the actual output range of the actions in the PPO algos and how should I rescale - eg. is clipping or transforming with eg. a sigmoid better?

unixpickle commented 6 years ago

@aeoost the simplest way around this is probably to create a wrapper that clips actions to the valid range.

maximecb commented 6 years ago

I'm with @aeoost. I've run into this problem as well, with the OpenAI baselines, and with other DRL implementations. In my opinion, it would make sense for the model to output actions into the correct range. The lower and upper bounds of the action_space should be taken into account. It's part of the specification of the environments. The specification should be respected, otherwise, there is no point in having those bounds to begin with.

olegklimov commented 6 years ago

Hi. My opinion is:

  1. Allow policy to output values higher than specified, clip it in env or wrapper.
  2. Add a little punishment (negative reward) for any high powered actions, like LunarLander does. You have a choice to punish clipped or unclipped actions, both work. Learned policy will be more economical in using actions and look better.
hamzamerzic commented 6 years ago

@olegklimov In the link I posted in the comment you can see that the environment actually crashes when the action is outside of the limits, it does not penalize the actions. This means that baselines are not fully compatible with gym, if I did not miss something crucial here, like some wrapper that handles this automatically.

hamzamerzic commented 6 years ago

@olegklimov I still feel this issue should not be closed. There is clearly an inconsistency between baselines and gym. Should I open an issue with the gym instead?

maximecb commented 6 years ago

@hamzamerzic @olegklimov Fully agree that this issue should not be closed. I think the fix belongs in baselines. It makes perfect sense for environments to have limits on the values continuous actions can take, and it should not be difficult to make a model that properly takes those into account. People shouldn't have to rely on hacks to get around this issue. Please fix the problem.

unixpickle commented 6 years ago

I agree that agents should respect action ranges, and that baselines does a bad job of this right now. We not only don't clip actions, we also don't stretch the policy outputs to the correct range. For example, if an action space has a large range like [-100, 100], our agents would still start by outputting values with stddev 1.

The question isn't really whether this is a problem--it's how to best fix the problem. I proposed doing the fix in a wrapper (which could be placed in baselines/common). This seems like a hack, and it kind of is. However, it would be easy to drop the wrapper into every algorithm in baselines and have an immediate universal fix. All the other fixes I can think of would be algorithm-specific.

@maximecb what do you have in mind when you say the model should output values in the correct range? If you parameterize a policy as a Gaussian, it must be able to output values in [-inf, inf]. If you clip the outputs coming out of the policy and use these clipped values in a policy gradient update, the resulting gradient will be biased (since the log probs won't reflect the clipped pdf). No matter how you slice it, the policy must believe that it can take any action value it wants to, otherwise the policy gradient is wrong.

There are plenty of implementation-specific points to insert clipping (e.g. the argument to step() in PPO1). However, using a wrapper will not require changing every single implementation, while pretty much any other approach will.

As a side note, the Gaussian distribution is probably not ideal for these kinds of problems anyway. See, for example, this paper on using the Beta distribution in RL. The Beta distribution is bounded between 0 and 1, making it more appropriate for problems where the action space is constrained to an interval.

maximecb commented 6 years ago

this paper on using the Beta distribution in RL. The Beta distribution is bounded between 0 and 1, making it more appropriate for problems where the action space is constrained to an interval.

That sounds like the right approach to me. If the beta distribution is bounded between 0 and 1, it will be easy to translate and scale that range appropriately. With a gaussian distribution which has infinite range, you can only have hacky fixes, and it will be hard to learn some action ranges.

olegklimov commented 6 years ago

Right, let's implement a wrapper because it is a correct thing to do.

LunarLander specifically I'd change to -Inf..Inf actions, it already penalizes "fuel usage".

I don't think action range supplied by env really has any meaning, actionable by agent. -Inf..Inf is a good example. We don't have a mechanism to "recommend" range to the agent (0..1 vs -1..+1 for example). And probably don't want, as this should be discoverable by agent automatically.

zishanahmed08 commented 6 years ago

Could someone please point me to an implementation where this issue is handled.

maximecb commented 6 years ago

IMO the paper pointed to by @unixpickle on the beta distribution for continuous RL is the best starting point. That author may be willing to share his implementation (if it isn't already on github).

zishanahmed08 commented 6 years ago

At the moment,even a hacky implementation with clipping will also do.I am not sure how to convert the infinite range to a finite range with clipping

Is this the only change i need to handle ,that I clip the action space as below and assume that the algorithm eventually figures out to output actions in the right range?

action = np.clip(action, self.action_space.low, self.action_space.high)

brendenpetersen commented 6 years ago

@olegklimov It's still important to respect the action ranges supplied by the envs, even though they may be arbitrary as in the LunarLander case. The fact that LunarLander's internal reward signal includes a penalty for actions makes its action range particularly arbitrary/unnecessary; however, simply changing it to (-inf, inf) is problematic because 1) it assumes we have domain knowledge about the environment and 2) you can imagine it could also change the optimal policy. An action space range is simply a constraint of the problem, so the solution can't simply be to change it. Besides, real applications have continuous action spaces that are bounded, so we need algorithms that can deal with them and benchmark environments that can respect them.

A wrapper is a good place to start, though it should be recognized that it's a hack, since as @unixpickle pointed out it will bias the policy gradient. Though I suppose that you had a black-box environment that clipped your actions, you'd never know...

We don't have a mechanism to "recommend" range to the agent (0..1 vs -1..+1 for example).

This is precisely why implementing a Beta policy is the only non-hack solution that makes sense to me. You can do this "recommending" in a principled way with any distribution whose support is an interval (i.e. no infinity). Simply make action_space.low correspond to the lower bound of the support and action_space.high correspond to the upper bound. You could also use a Gamma for an environment with an action range of (0, inf), for example.

pmwenzel commented 6 years ago

Is there currently any effort of implementing a beta policy as a baseline distribution? @brendenpetersen

brendenpetersen commented 6 years ago

@pmwenzel Not that I know of. I've started working on an implementation for an MlpBetaPolicy class (analogous to MlpPolicy, which uses a diagonal Gaussian policy) and associated distributions. I can share the fork with you.

pmwenzel commented 6 years ago

@brendenpetersen Sure, that would be great.

zishanahmed08 commented 6 years ago

@brendenpetersen Could you please share your implementation

brendenpetersen commented 6 years ago

@pmwenzel @zishanahmed08 I implemented a beta policy; feel free to try it out from my fork. Unfortunately, the baselines repository as a whole is not very modularized; for example, TRPO, PPO1, PPO2, and ACKTR all have their own policy implementations (with the lone exception of TRPO sharing PPO1's MLP policy), often with identical portions of code. I'm extremely uncomfortable with that; however, I also doubt they'd fold in a bunch of structural changes to their code if I did the modularization myself. So, for now I implemented the beta MLP policy as part of PPO2. It should be straightforward if not trivial to adapt to some of the other policy gradient algorithms.

Lastly, I included one hack, because sampling actions from the beta policy sometimes returned values of 0. or 1. (the very ends of the support of the beta distribution). I'm not sure why--perhaps it's a floating point precision issue. At any rate, this resulted in numerical issues for downstream calculations (like the likelihood ratio). The hack simply clips samples to [1e-5, 1 - 1e-5].

I'd like to reference Po-Wei Chou's thesis on the beta policy, on which I based the policy and which had some useful ideas like using a softplus activation for the beta shape parameters.

unixpickle commented 6 years ago

Potentially relevant: https://arxiv.org/abs/1802.07564

Sohojoe commented 6 years ago

@brendenpetersen did you run any benchmarks on beta vs non beta (ideally hopper & Walker2d)? - Unity-ML has a beta implementation in progress and when I tested it on my UnityMojoco implementations it performed less well than venila PPO - https://github.com/Unity-Technologies/ml-agents/pull/581

brendenpetersen commented 6 years ago

@Sohojoe No, I only tested on LunarLander. I don't have a MuJoCo license.

Beta-PPO didn't really perform better than Gaussian-PPO even on LunarLander; however, it's not really the fairest comparison because hyperparameters were originally tuned using the Gaussian policy. For all we know Beta could severely outperform Gaussian if its hyperparameters were independently tuned.

brucewayne1248 commented 6 years ago

I stumbled across the exact same problem, training LunarLanderContinuous-v2 with ppo1 baseline. @zishanahmed08 As you suggested, I added a single line action = np.clip(action, self.action_space.low, self.action_space.high) above the assertion of the action in lunar lander environment: lunarlandercode. And this solved the problem for me. This solution is most likely suboptimal.

david1309 commented 6 years ago

Still having issues with LunarLanderContinuous-v2 ... in the end what solutioins is better, to constrain the agent (e.g. PPO) to output actions within the legal range ? or to simply within the environments code correct for out of range actions via np.clip ?

olegklimov commented 6 years ago

Use clip. (either modify LunarLander or your code)

It is tested to work.

Problem is not 'correctness', problem is lack of gradient when action is clipped. But it is not a problem in this case, because fuel usage is punished in lunar lander, it's not beneficial to be at limit for a long time.

joellutz commented 6 years ago

Hi all, this is the workaround/hack which I've come up in order to respect the environments (possibly asymmetric) action bounds. In the baselines/ddpg/training.py I've added a scaling of the actions before they are executed.

# ...
for t_rollout in range(nb_rollout_steps):
    # Predict next action.
    action, q = agent.pi(obs, apply_noise=True, compute_Q=True)
    # action is an array with entries between -1 and 1

    # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
    target = scale_range(action, -1, 1, env.action_space.low, env.action_space.high)
    # target is an array with scaled actions

    # Execute next action.
    if rank == 0 and render:
        env.render()
    assert target.shape == env.action_space.shape
    new_obs, r, done, info = env.step(target)
    t += 1
    if rank == 0 and render:
        env.render()
    # ...

def scale_range(x, x_min, x_max, y_min, y_max):
    """ Scales the entries in x which have a range between x_min and x_max
    to the range defined between y_min and y_max. """
    # y = a*x + b
    # a = deltaY/deltaX
    # b = y_min - a*x_min (or b = y_max - a*x_max)
    y = (y_max - y_min) / (x_max - x_min) * x + (y_min*x_max - y_max*x_min) / (x_max - x_min)
    return y

This is just a simple linear scaling from the [-1, 1] range of the DDPG algorithm to the action range provided by the environment (e.g. [-3, 22.5]). It works for multiple action dimensions as well. I don't know if that's the way to go, but it worked for me (at least I could cope somehow with the asymmetric action bounds of my environment). If you have any concerns, feel free to comment on my solution.

wil3 commented 6 years ago

Hi @brendenpetersen, I can't seem to find your fork with the beta distribution implementation. I'm having performance issues with clipping the bounds and was hoping to try your approach.

Update: I found Tensorforce has implemented a beta distribution, https://github.com/reinforceio/tensorforce/blob/master/tensorforce/core/distributions/beta.py

dcolley commented 5 years ago

I would expect the env to be robust and not crash in the case of oob action. E.g. pressing up button has the same effect as pressing the up button harder - it's just not as efficient...

It's up to the author of the env to decide whether to punish oob actions, or handle/clip them.

For the author of the agent, using the action limits would accelerate training, but it's not a requirement. The agent will [eventually] learn that -5:5 has the same effect as -1:1, and will ignore [-5..-2]:[2..5] as having 'no benefit'.

However, the agent will struggle to learn this if the env crashes

ghost commented 5 years ago

What is the point of the action space having bounds if they are not respected? I have a custom environment where the action space is a Box(0.0, 1.0, n) space and want to use the PPO1/2 algo. This issue causes my env to crash as actions outside that range don't make sense. What is the actual output range of the actions in the PPO algos and how should I rescale - eg. is clipping or transforming with eg. a sigmoid better?

@ghost Did you solve your problem? I have the same issue, my action boundaries are between -500 500 and the actions from the network is changing between -3 sometimes 4. Are there anybody uses a mujoco environment with a large range of tork values? I clipped my actions and it is not enough. What is the output action range for PPO ? Is there any?

fbbfnc commented 5 years ago

Hey. I have a similar problem. I'm building a custom environment to solve a research problem. This is my observation space and every action changes these values, increasing or decreasing them. self.observation_space = Box( low=np.array([0,100,200,300,400,500,600,700...]), high=np.array([1000,20000,3000,4000,5000,6000,7000...]), dtype=np.int64)

I'm using the observation space as a way to track the status of my agent in the environment. I've imported my environment in the Rllib library to use some algorithms. I can't go out of bounds but that was happening, making my algorithm fail. So i've implemented a check in the step() function that skips the action if it takes out of bounds the observation space and gives back a reward as in the previous step. Is there a cleaner way to do so, in your opinion? Maybe a good idea is to reduce the action_space when a value is at the edge of the domain, but how can i do that?

ghost commented 5 years ago

hi @fbbfnc what have I done to solve the problem is, I added an action_modifier() function to my env.py file. That is taking the action from the network and adjusting the action by multiplying with the numbers suitable for my environment after multiplication, I clipped the values according to my boundaries, and that worked for me. My agent is learning with TRPO.

OnedgeLee commented 5 years ago

Does action range clipping on environment really works well? My custom environment action range is far from [-1, 1] and when I apply clipping on environment, agent hit the clipped value near [-1, 1] side on and on. Because "proba_distribution_from_latent()" has initial parameters "init_scale=1.0, init_bias=0.0", initial action range is [-1, 1]. As training progresses, this range have to adaptively move toward the range which covers that of environment, but it doesn't seems act like that, because agent cannot explore due to clipping, no learning occurs. (stddev cannot role because of clipping) No exploring, no information gain, no learning. In my opinion, init_bias of linear function of "proba_distribution_from_latent()", which infers mean, should be in environment action range. Am I missing some point?

xubo92 commented 5 years ago

Hi @brendenpetersen

Very appreciate your work but where can I find your implementation of mlp beta policy? I checked your fork "stable-baseline" but did not find it in PPO2 or Common folders.

denyHell commented 4 years ago

Hi @brendenpetersen Have you tested that sampling from beta can result in value 0 or 1? I am using tfp.distributions.Beta to get instance of the distribution. Have you done any parameters tuning for beta policy?