Closed Zarzard closed 1 year ago
Hi @Zarzard , this is also an implementation trick in MPO Sec. 4.2.1, which has connection with the CMA-ES method and can better jump out from local optimum and encourages exploration. I paste the original explanations as follows:
This procedure has two advantages: 1) the gradient w.r.t. the parameters of the covariance is now independent of changes in the mean; hence the only way the policy can increase the likelihood of good samples far away from the mean is by stretching along the value landscape. This gives us the ability to grow and shrink the distribution supervised by samples without introducing any extra entropy term to the objective [Abdolmaleki et al., 2015; Tangkaratt et al., 2017] (see also Figures 1 and 2 for an experiment showing this effect). 2) we can set the KL bound for mean and co-variance separately. The latter is especially useful in high-dimensional action spaces, where we want to avoid problems with ill-conditioning of the covariance matrix but want fast learning, enabled by large changes to the mean.
Since CVPO is strongly motivated by MPO, I keep most implementation tricks and procedures as MPO. BTW, I am planning to re-implement the algorithm and will try to remove those tricks and see how it performs.
I see. Thanks for the detailed answer!
BTW, have you ever tried Normal
as the policy distribution for each individual dimension of the action (rather than the much slower MultivariateNormal
) ? Does this mismatch trick and the separate KL bounds of mean and variance work in that case?
Hi @Zarzard , no I haven't try the Normal
distribution, but plan to try it. I think it will mismatch the decoupling trick as suggested in MPO, because in CMA-ES they also use the covariance matrix rather than diagonal variance matrix. I am not sure how it will influence the performance but it is definitely worthy to try.
Yes, for Normal
the decoupling trick must be modified to the corresponding formula of single-variate normal distribution:
kl_mu = torch.mean((mean1 - mean2) ** 2 / (2 * std1 ** 2))
and
kl_sigma = torch.mean(log_std1 - log_std2 + (std2 ** 2) / (2 * std1 ** 2))
but essentially this can work the same with the one in MultivariateNormal
?
MultivariateNormal
considers covariance between actions, while Normal
assume each action dimension is independent to each other, so they are not the same. Incorporating covariance between actions makes sense and may have slightly performance improvement than viewing them independently. But I think just using Normal
definitely should work.
I see. Thanks for the answers.
Hi zuxin, in this line and this line the mu and the sigma seem to have a mismatch? And why don't you maximize the log_prob of merely the current policy in loss_p?