ghliu / pytorch-ddpg

Implementation of the Deep Deterministic Policy Gradient (DDPG) using PyTorch
Apache License 2.0
569 stars 157 forks source link

Computation of target values with terminal states #11

Closed abyardim closed 4 years ago

abyardim commented 4 years ago

I have a question about the following line in the code in the training logic:

https://github.com/ghliu/pytorch-ddpg/blob/e9db328ca70ef9daf7ab3d4b44975076ceddf088/ddpg.py#L75

In the computation of the target Q-values, shouldn't the multiplication be done with

(1-to_tensor(terminal_batch.astype(np.float)))

as we would like the next state Q-values to be zeroed if the state was terminal. In fact, in this case the next state might not belong to the same episode as the current state, thus the evaluation of the target network is invalid.

Apologies if I'm missing something trivial.

abyardim commented 4 years ago

Apparently this issue was already opened and addressed, the class SequenctialMemory produces the values correctly for the terminal_batch.