Closed HareshKarnan closed 5 years ago
For correctness: yes! I used the advantage-based terminology so that the computation graph setup and update step would more closely resemble what you will find in a standard policy optimization codebase. Because the 80-line VPG code doesn't implement value function approximation, it can't do the proper thing (although I recommend you try implementing this as an exercise!).
If you're looking for a better implementation that has all the bells an whistles (including value functions and generalized advantage estimation), I recommend looking into Spinning Up in Deep RL, which is more-or-less the successor to this tutorial.
I'm reimplementing this in PyTorch to get a grasp of VPG. Will look into doing the same for the SpinningUp implementation, thanks !
Hi @jachiam , thanks for making this public ! I'm using this code to learn how to implement VPG and I noticed that you use reward to go as your batch advantage, but shouldn't advantage be the difference between reward to go at each step and the value at each step ?