The assert seems to be hard-coded for the case when self.actor_grad == reinforce. (P || !C): if gradients flow from the critic loss, then actor params must be updated using grads from both policy losses. This doesn't support the case when self.actor_grad == dynamics, where gradients flow from the dynamics estimate.
The assert seems to be hard-coded for the case when self.actor_grad == reinforce. (P || !C): if gradients flow from the critic loss, then actor params must be updated using grads from both policy losses. This doesn't support the case when self.actor_grad == dynamics, where gradients flow from the dynamics estimate.