ikostrikov / pytorch-a2c-ppo-acktr-gail

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
MIT License
3.53k stars 832 forks source link

Suggestion - implement some "tricks" that improve performance #266

Open henrycharlesworth opened 3 years ago

henrycharlesworth commented 3 years ago

Given how popular this repo is (and rightly so), I was thinking it might be a good idea to implement some simple tricks that have been shown to improve performance with on-policy RL algorithms. I'm thinking mostly about this paper: https://arxiv.org/pdf/2006.05990.pdf, where they do a large scale study of all of the little decisions that can make a big difference in performance.

I haven't ran extensive experiments but I've implemented a couple of the things they mention and they do seem to significantly boost performance. In particular, modifying the code so that the advantages are recomputed every epoch of the update as they recommend does seem to improve performance. And then an even simpler thing with the initialisation seems to make an even bigger difference - for continuous control initialising the action std in a way such that initially its value is 0.5 for each dimension, and then multiplying the weights of the output policy layer by 0.01 at the start (there are a lot of other things they discuss too in that paper).

ChenDRAG commented 3 years ago

@henrycharlesworth I have tried a number of suggestions proposed in the paper you mentioned (ablation studies suggest some of them are useful, some are temporarily not) and implement "recompute advantage" strategy, which is helpful indeed. It is in my benchmark of mujoco here, check out the details if you are interested.