Suggestion - implement some "tricks" that improve performance

Given how popular this repo is (and rightly so), I was thinking it might be a good idea to implement some simple tricks that have been shown to improve performance with on-policy RL algorithms. I'm thinking mostly about this paper: https://arxiv.org/pdf/2006.05990.pdf, where they do a large scale study of all of the little decisions that can make a big difference in performance.

I haven't ran extensive experiments but I've implemented a couple of the things they mention and they do seem to significantly boost performance. In particular, modifying the code so that the advantages are recomputed every epoch of the update as they recommend does seem to improve performance. And then an even simpler thing with the initialisation seems to make an even bigger difference - for continuous control initialising the action std in a way such that initially its value is 0.5 for each dimension, and then multiplying the weights of the output policy layer by 0.01 at the start (there are a lot of other things they discuss too in that paper).

ikostrikov / pytorch-a2c-ppo-acktr-gail

Suggestion - implement some "tricks" that improve performance #266