Open skervim opened 5 years ago
Hello,
Is it possible to add a beta distribution to the repository?
This is not planned but we are opened to PR. Also, as for the huber loss (see #95 ), we would need to run several benchmark to assess the utility of such feature before merging it.
Hello,
I'm working on continuous control problems with asymmetric, bounded continuous action spaces. While gaussian policies offer descent performance, it often takes long time to train and the action distribution is often not 100% matching the problem space. My current workaround is to rescale the action and clip it (btw, I had to disable the way clipping is currently done so I can apply a custom transformation). But it only helps with matching environment constraints.
Some real world continuous control problems would benefit from this. Mainly thinking about mechanical engine parts control, or industrial machine optimization (e.g. calibration).
I'm testing the following (draft) implementation
class BetaProbabilityDistribution(ProbabilityDistribution):
def __init__(self, flat):
self.flat = flat
print(flat)
# as per http://proceedings.mlr.press/v70/chou17a/chou17a.pdf
alpha = 1.0 + tf.layers.dense(flat, len(flat.shape)-1, activation=tf.nn.softplus)
beta = 1.0 + tf.layers.dense(flat, len(flat.shape)-1, activation=tf.nn.softplus)
self.dist = tf.distributions.Beta(concentration1=alpha, concentration0=beta, validate_args=True, allow_nan_stats=False)
def flatparam(self):
return self.flat
def mode(self):
return self.dist.mode()
def neglogp(self, x):
return tf.reduce_sum(-self.dist.log_prob(x), axis=-1)
def kl(self, other):
assert isinstance(other, BetaProbabilityDistribution)
return self.dist.kl_divergence(other.dist)
def entropy(self):
return self.dist.entropy()
def sample(self):
return self.dist.sample()
@classmethod
def fromflat(cls, flat):
return cls(flat)
For now I've been able to run it with a custom policy like:
pdtype = BetaProbabilityDistributionType(ac_space)
...
obs = self.processed_x
with tf.variable_scope("model"):
x = obs
...
x = tf.nn.relu(tf.layers.dense(x, 128, name='pi_fc'+str(i), kernel_initializer=U.normc_initializer(1.0)))
self.policy = tf.layers.dense(x, ac_space.shape[0], name='pi')
self.proba_distribution = pdtype.proba_distribution_from_flat(x)
x = obs
...
x = tf.nn.relu(tf.layers.dense(x, 128, name='vf_fc' + str(i), kernel_initializer=U.normc_initializer(1.0)))
value_fn = tf.layers.dense(x, 1, name='vf')
self.q_value = tf.layers.dense(value_fn, 1, name='q')
...
I'm now running it with PPO1 & PPO2 against my benchmark environment (asym, bounded, continuous action space) to see how it compares with Gaussian. I'm running into troubles with TRPO, but I didn't have time to investigate further.
Note: it still requires to rescale action from [0,1] to environment action space. This can be done manually, or it could be added a custom post-processing mechanism of the action.
Well, after testing a bit, it doesn't seem to improve overall performance, at least on my environment (I didn't test on classic control tasks). It does seem to converge, but it's slower and average reward is lower than for Gaussian policy.
I would say you need some hyperparameter tuning... The parameters present in the current implementation were tuned for gaussian policies, so it is not completely fair to compare them without tuning.
@araffin I'll try to spend some time on that. Any idea of what hyperparam would be best to try tuning first?
The best practice would be to use hyperband or hyperopt to do it automatically (see https://github.com/araffin/robotics-rl-srl#hyperparameter-search). This script written by @hill-a can get you started.
Otherwise, with PPO, the hyperparameters that are the most important in my experience: n_steps (together with nminibatches), ent_coef (entropy coeff), lam (GAE lambda coeff). Additionally, you can also tune noptepochs, cliprange and the learning rate.
@antoine-galataud can you share your implementation of the beta distribution ?
@HareshMiriyala sure, I'll PR that soon.
I don't think it's ready for a PR so here is the branch link: https://github.com/antoine-galataud/stable-baselines/tree/beta-pd
This is based on Tensorflow Beta implementation and Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution (notably the idea of using softplus activation and adding 1 as a constant to alpha and beta).
Usage: I didn't work on configuring what distribution to use in a generic manner, you have to use it in a custom policy. Ideally there should be way to choose between gaussian and beta in https://github.com/hill-a/stable-baselines/blob/596a5c45d611ab560ae98dfc348383f48ca76966/stable_baselines/common/distributions.py#L467 You can refer to example above about creating a custom policy that uses it.
@antoine-galataud before submitting a PR, please look at the contribution guide https://github.com/hill-a/stable-baselines/pull/148 (that would save time ;)) It will be merged with master soon.
@antoine-galataud Thanks a bunch !
@antoine-galataud How are you handling scaling the sample from beta distribution (0,1) to the action space bounds ?
@HareshMiriyala I do it like this:
def step(self, action):
action = action * (self.action_space.high - self.action_space.low) + self.action_space.low
...
Thanks, where do you make this change ? I'm not so familiar with the code, can you guide me into which file and class you make this change in ?
@araffin: Could you give an estimate for how long it will take until the beta distribution will be merged with master? Thanks in advance!
@skervim well, I don't know as I'm not in charge if implementing it nor testing it. However that does not mean you cannot test it before (cf install from source in the doc).
@araffin: Sorry!! I misunderstood your message:
It will be merged with master soon.
@antoine-galataud: I don't know if it helps you, but there is also a beta distribution implemented in Tensorforce
(here). May be it can serve you as orientation? Thank you very much for implementing the beta distribution. I think it will help in a lot of environments and RL problems. :)
@HareshMiriyala the step() function is one that you implement when you write a custom gym env. you can also modify an existing env to see how it goes.
@skervim I couldn’t dedicate time to testing (apart a quick one with a custom env). I also have to write unit tests. If you have access to continuous control environments (mujoco, ...) to give it a try that would definitely help. Apart from that, I’d like to provide a better integration with action value scaling and distribution type selection based on configuration. Maybe later, if we see any benefit with this implementation. It doesn’t prevent from testing it as is anyway.
@skervim if you want to test on continuous envs for free (no mujoco licence required), I recommend you the pybullets envs (see the rl baselines zoo)
@HareshMiriyala I do it like this:
def step(self, action): action = action * (self.action_space.high - self.action_space.low) + self.action_space.low ...
@antoine-galataud It's legit/better to perform this operation in the step function of the environment? Or is better to put it in the network (updating the proba_distribution_from_latent
function)? In the first case, during training, after a certain amount of episodes I have experimented a drop of the average reward. If I put this as final network layer, this doesn't happen (although the convergence is not so good).
@HareshMiriyala I’ve seen rescaling performed in various parts of the code, depending on the env, the framework or the project. In my opinion, it shouldn’t impact overall performance, if rescaling output is consistently giving same output for a given input. Out of curiousity, on what type of problem are you applying this and how is the performance compared to Gaussian pd?
There is an issue at open-ai baselines ( here ) about the advantages of a beta distribution over a diagonal gaussian distribution + clipping. The relevant paper: Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution Is it possible to add a beta distribution to the repository?