liuzuxin / cvpo-safe-rl

Code for "Constrained Variational Policy Optimization for Safe Reinforcement Learning" (ICML 2022)
GNU General Public License v3.0
63 stars 7 forks source link

About π1 and π2. #6

Closed Zarzard closed 1 year ago

Zarzard commented 1 year ago

Hi zuxin, in this line and this line the mu and the sigma seem to have a mismatch? And why don't you maximize the log_prob of merely the current policy in loss_p?

liuzuxin commented 1 year ago

Hi @Zarzard , this is also an implementation trick in MPO Sec. 4.2.1, which has connection with the CMA-ES method and can better jump out from local optimum and encourages exploration. I paste the original explanations as follows:

This procedure has two advantages: 1) the gradient w.r.t. the parameters of the covariance is now independent of changes in the mean; hence the only way the policy can increase the likelihood of good samples far away from the mean is by stretching along the value landscape. This gives us the ability to grow and shrink the distribution supervised by samples without introducing any extra entropy term to the objective [Abdolmaleki et al., 2015; Tangkaratt et al., 2017] (see also Figures 1 and 2 for an experiment showing this effect). 2) we can set the KL bound for mean and co-variance separately. The latter is especially useful in high-dimensional action spaces, where we want to avoid problems with ill-conditioning of the covariance matrix but want fast learning, enabled by large changes to the mean.

Since CVPO is strongly motivated by MPO, I keep most implementation tricks and procedures as MPO. BTW, I am planning to re-implement the algorithm and will try to remove those tricks and see how it performs.

Zarzard commented 1 year ago

I see. Thanks for the detailed answer! BTW, have you ever tried Normal as the policy distribution for each individual dimension of the action (rather than the much slower MultivariateNormal) ? Does this mismatch trick and the separate KL bounds of mean and variance work in that case?

liuzuxin commented 1 year ago

Hi @Zarzard , no I haven't try the Normal distribution, but plan to try it. I think it will mismatch the decoupling trick as suggested in MPO, because in CMA-ES they also use the covariance matrix rather than diagonal variance matrix. I am not sure how it will influence the performance but it is definitely worthy to try.

Zarzard commented 1 year ago

Yes, for Normal the decoupling trick must be modified to the corresponding formula of single-variate normal distribution: kl_mu = torch.mean((mean1 - mean2) ** 2 / (2 * std1 ** 2)) and kl_sigma = torch.mean(log_std1 - log_std2 + (std2 ** 2) / (2 * std1 ** 2)) but essentially this can work the same with the one in MultivariateNormal?

liuzuxin commented 1 year ago

MultivariateNormal considers covariance between actions, while Normal assume each action dimension is independent to each other, so they are not the same. Incorporating covariance between actions makes sense and may have slightly performance improvement than viewing them independently. But I think just using Normal definitely should work.

Zarzard commented 1 year ago

I see. Thanks for the answers.