Open mrsamsami opened 1 year ago
I could be completely wrong, but is it possible that the assert is hard-coded for the case when actor_grad == "reinforce"?
Same problem here!
You are right, it is very likely that assertion was written assuming actor_grad=reinforce. What if you simply remove it, does it work then?
To be honest, I did way less testing with actor_grad=dynamics. The functionality did work at one point and was tested with DMC, but something could have changed since then.
Yes, ignoring it works. However, I'm still a little confused about how the dynamics back-prop works given the non-diff value target. If we scope out the entropy loss, can you clarify how, in the code, the actor's parameters are updated?
If you use the reinforce policy gradient then you don't back-prop through the dynamics anymore.
When running the code on DMC, because the
actor_grad
isdynamics
; therefore,loss_policy
would be-value_target
.value_target
is not dependent on the actor's policy distribution, and so,loss_policy
does not have any gradient flowing through it with respect to the actor's parameters. The assertion will beassert (False and True) or not True
, sinceloss_policy
does not require gradients. Therefore, the assertion becomesFalse
. How can we fix it?