jachiam / rl-intro

128 stars 26 forks source link

batch_advs - calculating advantage function #1

Closed HareshKarnan closed 5 years ago

HareshKarnan commented 5 years ago

Hi @jachiam , thanks for making this public ! I'm using this code to learn how to implement VPG and I noticed that you use reward to go as your batch advantage, but shouldn't advantage be the difference between reward to go at each step and the value at each step ?

jachiam commented 5 years ago

For correctness: yes! I used the advantage-based terminology so that the computation graph setup and update step would more closely resemble what you will find in a standard policy optimization codebase. Because the 80-line VPG code doesn't implement value function approximation, it can't do the proper thing (although I recommend you try implementing this as an exercise!).

jachiam commented 5 years ago

If you're looking for a better implementation that has all the bells an whistles (including value functions and generalized advantage estimation), I recommend looking into Spinning Up in Deep RL, which is more-or-less the successor to this tutorial.

HareshKarnan commented 5 years ago

I'm reimplementing this in PyTorch to get a grasp of VPG. Will look into doing the same for the SpinningUp implementation, thanks !