When you attempt to use experience replay in actor critic setting, to me it looks that only critic part is trained (gradients propagated), but the actor part that comes from stored log_probs in numpy array cannot back propagate gradients. However, imho the actual problem is more general, since policy is something that supposed to be evolving it does not make sense to store results of older worse policy. log_probs need to be recomputed in learning function the same way as outputs of critic network.
Hello @philtabor ,
When you attempt to use experience replay in actor critic setting, to me it looks that only critic part is trained (gradients propagated), but the actor part that comes from stored log_probs in numpy array cannot back propagate gradients. However, imho the actual problem is more general, since policy is something that supposed to be evolving it does not make sense to store results of older worse policy. log_probs need to be recomputed in learning function the same way as outputs of critic network.