philtabor / Youtube-Code-Repository

Repository for most of the code from my YouTube channel
873 stars 479 forks source link

A2C with experience replay #19

Open aivanni opened 4 years ago

aivanni commented 4 years ago

Hello @philtabor ,

When you attempt to use experience replay in actor critic setting, to me it looks that only critic part is trained (gradients propagated), but the actor part that comes from stored log_probs in numpy array cannot back propagate gradients. However, imho the actual problem is more general, since policy is something that supposed to be evolving it does not make sense to store results of older worse policy. log_probs need to be recomputed in learning function the same way as outputs of critic network.