Computation of target values with terminal states

I have a question about the following line in the code in the training logic:

https://github.com/ghliu/pytorch-ddpg/blob/e9db328ca70ef9daf7ab3d4b44975076ceddf088/ddpg.py#L75

In the computation of the target Q-values, shouldn't the multiplication be done with

(1-to_tensor(terminal_batch.astype(np.float)))

as we would like the next state Q-values to be zeroed if the state was terminal. In fact, in this case the next state might not belong to the same episode as the current state, thus the evaluation of the target network is invalid.

Apologies if I'm missing something trivial.

ghliu / pytorch-ddpg

Computation of target values with terminal states #11