In the computation of the target Q-values, shouldn't the multiplication be done with
(1-to_tensor(terminal_batch.astype(np.float)))
as we would like the next state Q-values to be zeroed if the state was terminal. In fact, in this case the next state might not belong to the same episode as the current state, thus the evaluation of the target network is invalid.
I have a question about the following line in the code in the training logic:
https://github.com/ghliu/pytorch-ddpg/blob/e9db328ca70ef9daf7ab3d4b44975076ceddf088/ddpg.py#L75
In the computation of the target Q-values, shouldn't the multiplication be done with
as we would like the next state Q-values to be zeroed if the state was terminal. In fact, in this case the next state might not belong to the same episode as the current state, thus the evaluation of the target network is invalid.
Apologies if I'm missing something trivial.