Open gauravjain14 opened 4 years ago
I believe you are right that the calculation of the delta is bugged, and will always yield zero in the current setup. I'll try fixing it over the weekend.
Thanks!
I am not sure if you want a PR for this. I could do that as well if the order that I have mentioned seems right to you
No need, I'll get it figured out.
Here the comment says that train the policy with a single step of gradient descent and then calculate the delta difference.
However, from the code, it seems like that the backward propagation and step() is being performed after both old and the new values of loss are calculated. Shouldn't the following be the order:
Or there is something wrong in my fundamental understanding?