hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.11k stars 726 forks source link

Beta distribution as policy for environments with bounded continuous action spaces [feature request] #112

Open skervim opened 5 years ago

skervim commented 5 years ago

There is an issue at open-ai baselines ( here ) about the advantages of a beta distribution over a diagonal gaussian distribution + clipping. The relevant paper: Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution Is it possible to add a beta distribution to the repository?

araffin commented 5 years ago

Hello,

Is it possible to add a beta distribution to the repository?

This is not planned but we are opened to PR. Also, as for the huber loss (see #95 ), we would need to run several benchmark to assess the utility of such feature before merging it.

antoine-galataud commented 5 years ago

Hello,

I'm working on continuous control problems with asymmetric, bounded continuous action spaces. While gaussian policies offer descent performance, it often takes long time to train and the action distribution is often not 100% matching the problem space. My current workaround is to rescale the action and clip it (btw, I had to disable the way clipping is currently done so I can apply a custom transformation). But it only helps with matching environment constraints.

Some real world continuous control problems would benefit from this. Mainly thinking about mechanical engine parts control, or industrial machine optimization (e.g. calibration).

antoine-galataud commented 5 years ago

I'm testing the following (draft) implementation

class BetaProbabilityDistribution(ProbabilityDistribution):
    def __init__(self, flat):
        self.flat = flat
        print(flat)
        # as per http://proceedings.mlr.press/v70/chou17a/chou17a.pdf
        alpha = 1.0 + tf.layers.dense(flat, len(flat.shape)-1, activation=tf.nn.softplus)
        beta  = 1.0 + tf.layers.dense(flat, len(flat.shape)-1, activation=tf.nn.softplus)
        self.dist = tf.distributions.Beta(concentration1=alpha, concentration0=beta, validate_args=True, allow_nan_stats=False)

    def flatparam(self):
        return self.flat

    def mode(self):
        return self.dist.mode()

    def neglogp(self, x):
        return tf.reduce_sum(-self.dist.log_prob(x), axis=-1)

    def kl(self, other):
        assert isinstance(other, BetaProbabilityDistribution)
        return self.dist.kl_divergence(other.dist)

    def entropy(self):
        return self.dist.entropy()

    def sample(self):
        return self.dist.sample()

    @classmethod
    def fromflat(cls, flat):
        return cls(flat)

For now I've been able to run it with a custom policy like:

        pdtype = BetaProbabilityDistributionType(ac_space)
        ...

        obs = self.processed_x

        with tf.variable_scope("model"):
            x = obs
            ...
            x = tf.nn.relu(tf.layers.dense(x, 128, name='pi_fc'+str(i), kernel_initializer=U.normc_initializer(1.0)))
            self.policy = tf.layers.dense(x, ac_space.shape[0], name='pi')
            self.proba_distribution = pdtype.proba_distribution_from_flat(x)

            x = obs
            ...
            x = tf.nn.relu(tf.layers.dense(x, 128, name='vf_fc' + str(i), kernel_initializer=U.normc_initializer(1.0)))
            value_fn = tf.layers.dense(x, 1, name='vf')
            self.q_value = tf.layers.dense(value_fn, 1, name='q')

...            

I'm now running it with PPO1 & PPO2 against my benchmark environment (asym, bounded, continuous action space) to see how it compares with Gaussian. I'm running into troubles with TRPO, but I didn't have time to investigate further.

Note: it still requires to rescale action from [0,1] to environment action space. This can be done manually, or it could be added a custom post-processing mechanism of the action.

antoine-galataud commented 5 years ago

Well, after testing a bit, it doesn't seem to improve overall performance, at least on my environment (I didn't test on classic control tasks). It does seem to converge, but it's slower and average reward is lower than for Gaussian policy.

araffin commented 5 years ago

I would say you need some hyperparameter tuning... The parameters present in the current implementation were tuned for gaussian policies, so it is not completely fair to compare them without tuning.

antoine-galataud commented 5 years ago

@araffin I'll try to spend some time on that. Any idea of what hyperparam would be best to try tuning first?

araffin commented 5 years ago

The best practice would be to use hyperband or hyperopt to do it automatically (see https://github.com/araffin/robotics-rl-srl#hyperparameter-search). This script written by @hill-a can get you started.

Otherwise, with PPO, the hyperparameters that are the most important in my experience: n_steps (together with nminibatches), ent_coef (entropy coeff), lam (GAE lambda coeff). Additionally, you can also tune noptepochs, cliprange and the learning rate.

HareshKarnan commented 5 years ago

@antoine-galataud can you share your implementation of the beta distribution ?

antoine-galataud commented 5 years ago

@HareshMiriyala sure, I'll PR that soon.

antoine-galataud commented 5 years ago

I don't think it's ready for a PR so here is the branch link: https://github.com/antoine-galataud/stable-baselines/tree/beta-pd

This is based on Tensorflow Beta implementation and Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution (notably the idea of using softplus activation and adding 1 as a constant to alpha and beta).

Usage: I didn't work on configuring what distribution to use in a generic manner, you have to use it in a custom policy. Ideally there should be way to choose between gaussian and beta in https://github.com/hill-a/stable-baselines/blob/596a5c45d611ab560ae98dfc348383f48ca76966/stable_baselines/common/distributions.py#L467 You can refer to example above about creating a custom policy that uses it.

araffin commented 5 years ago

@antoine-galataud before submitting a PR, please look at the contribution guide https://github.com/hill-a/stable-baselines/pull/148 (that would save time ;)) It will be merged with master soon.

HareshKarnan commented 5 years ago

@antoine-galataud Thanks a bunch !

HareshKarnan commented 5 years ago

@antoine-galataud How are you handling scaling the sample from beta distribution (0,1) to the action space bounds ?

antoine-galataud commented 5 years ago

@HareshMiriyala I do it like this:

def step(self, action):
  action = action * (self.action_space.high - self.action_space.low) + self.action_space.low
  ...
HareshKarnan commented 5 years ago

Thanks, where do you make this change ? I'm not so familiar with the code, can you guide me into which file and class you make this change in ?

skervim commented 5 years ago

@araffin: Could you give an estimate for how long it will take until the beta distribution will be merged with master? Thanks in advance!

araffin commented 5 years ago

@skervim well, I don't know as I'm not in charge if implementing it nor testing it. However that does not mean you cannot test it before (cf install from source in the doc).

skervim commented 5 years ago

@araffin: Sorry!! I misunderstood your message:

It will be merged with master soon.

@antoine-galataud: I don't know if it helps you, but there is also a beta distribution implemented in Tensorforce (here). May be it can serve you as orientation? Thank you very much for implementing the beta distribution. I think it will help in a lot of environments and RL problems. :)

antoine-galataud commented 5 years ago

@HareshMiriyala the step() function is one that you implement when you write a custom gym env. you can also modify an existing env to see how it goes.

@skervim I couldn’t dedicate time to testing (apart a quick one with a custom env). I also have to write unit tests. If you have access to continuous control environments (mujoco, ...) to give it a try that would definitely help. Apart from that, I’d like to provide a better integration with action value scaling and distribution type selection based on configuration. Maybe later, if we see any benefit with this implementation. It doesn’t prevent from testing it as is anyway.

araffin commented 5 years ago

@skervim if you want to test on continuous envs for free (no mujoco licence required), I recommend you the pybullets envs (see the rl baselines zoo)

AGPX commented 4 years ago

@HareshMiriyala I do it like this:

def step(self, action):
  action = action * (self.action_space.high - self.action_space.low) + self.action_space.low
  ...

@antoine-galataud It's legit/better to perform this operation in the step function of the environment? Or is better to put it in the network (updating the proba_distribution_from_latent function)? In the first case, during training, after a certain amount of episodes I have experimented a drop of the average reward. If I put this as final network layer, this doesn't happen (although the convergence is not so good).

antoine-galataud commented 4 years ago

@HareshMiriyala I’ve seen rescaling performed in various parts of the code, depending on the env, the framework or the project. In my opinion, it shouldn’t impact overall performance, if rescaling output is consistently giving same output for a given input. Out of curiousity, on what type of problem are you applying this and how is the performance compared to Gaussian pd?