Closed henry-prior closed 2 years ago
Hi henry-prior,
Indeed one could spend more effort optimizing the dual but I would question whether the benefits are worth the added core complexity, especially since, as with the other objectives, the dual function is only approximated at the sampled batch. For this reason it seems appropriate to use SGD/Adam similar optimization strategies. Since you've done the comparison though, I'd love to see whether the performance varies significantly one way or another!
Having said that, I do agree with you that being slightly more careful with the dual parameters is important, which is why we use a separate optimizer with its own learning rate to ensure the duals are such that the desired constraint is satisfied on average. In fact we track these by, e.g. logging kl_q_rel, which is the relative E-step constraint (kl_mean_rel for the M-step constraint) and should stay close to 1.
Abbas may have different opinions though, so I'll ping him in case he wants to add a couple of pennies.
Thanks for the important question! Happy Acming!
Bobak
Hi Bobak,
Thanks for your insight here! We're definitely on the same page about the setup for optimizing the dual parameters, I'm just considering taking a few more gradient steps during each training epoch. After thinking a bit more about this I realized that in my implementation I optimize the temperature before calculating the weights which is different than how it's described in "Relative Entropy Regularized Policy Iteration" right after the quote I shared the authors mention taking those gradient steps after the weight calculation, i.e. the temperature for the weight calculation should be from the previous step. Maybe not hugely important, but I'll use this approach when making the comparison for Acme.
Thanks for pointing out the logged KL between the target and non-parametric policies, I'll definitely keep that in mind when experimenting.
Going to kick off some runs and will share comparisons. If there are any envs/tasks you'd particularly like to see let me know. I'll start with Humanoid-Stand.
Henry
Hey @bshahr following up here with some initial results. Right off the bat, I don't see a real benefit of taking multiple gradient steps on the temperature parameters based on minimal testing. A caveat here being the minimal testing, as the exact setup and hyperparameters could make a difference. I'll detail other setups I'd like to test at the end of this comment.
Here are some plots. Cartpole results are on the first 20 random seeds with 100,000 environment steps and humanoid results are only on seed=0
with 2,000,000 environment steps. Using my own compute here so want to experiment more before running more humanoid seeds.
First to share my code: https://github.com/deepmind/acme/compare/master...henry-prior:acme:full-optimization-of-temperature
I took a pretty naive approach to start which minimizes changes to the current training setup by
dual_optimizer
object by using optax.masked
and optax.chain
.This means that the updating of the temperature parameters is done the same way as before on the gradients of the losses. I've modified the learning rate for the temperature parameters so the total step size is comparable to what it was previously, but the gradient steps are more fine-grained. This may be too strong of a constraint, and it may make more sense to increase the learning rate a bit more.
What I'd like to try next:
Ok great! Thanks for the confirmation Henry! This follows my expectation. I'll close the issue now but feel free to update us all with your future findings. πππ½
Hi,
I have a question about the MPO implementation, specifically the temperature parameter used for the importance weights. Based on the derivation in the papers I've gathered that the temperature parameter should be the optimal value of the dual function at each iteration, and given its convexity we can make more of an effort to fully optimize during each step e.g. "via a few steps of gradient descent on \eta for each batch" right below the formula for \eta on page 4 of "Relative Entropy Regularized Policy Iteration". This is also mentioned below equation 4 on page 4 of "A Distributional View on Multi-Objective Policy Optimization".
In the Acme implementation this is absent, so curious about whether or not this is still considered a useful/necessary aspect of the algorithm by researchers at DeepMind. If you'd like, I'm happy to add (possibly optional) functionality for it. I've been messing around with it locally for a bit and see how it's a bit tricky in the current architecture but have something that may work while still being clean and not breaking the current design. In my own implementation of MPO in JAX I use SciPy's SLSQP optimizer on the temperature, which works well, but may bit a bit difficult in Acme given that it requires you to be outside of JITing and able to pull
DeviceArray
values back to the Python process. In my testing it wasn't any slower than a gradient optimizer, but you do break up the asynchronous dispatch which could create noticeable bottleneck scenarios.Curious to hear about the decision making here either way, Acme is an amazing project with a great design!