In the DQN loss, the update should only happen for the observed action. Assume a mini-batch of (s, a, r, s2) with m samples in the mini-batch (therefore, s and s2 would be m x n if n is the nb of features and so on). Then, for sample j in the mini-batch, only a[j] should contribute to the loss, which says that all the output elements of the Q-network corresponding to the target[j][:] except target[j][a[j]] should be masked.
Here, the target values (except target[j][a[j]]) are set to zero instead of masking network's output, which means the target value of zero is used and the Q-network is trained toward value zero for all other actions than a[j], when we are at the state s[j, :].
Am I missing something?
EDIT:
It is OK (though a bit confusing). The difference has actually been put in the target and not the target itself.
In the DQN loss, the update should only happen for the observed action. Assume a mini-batch of
(s, a, r, s2)
withm
samples in the mini-batch (therefore,s
ands2
would bem x n
ifn
is the nb of features and so on). Then, for samplej
in the mini-batch, onlya[j]
should contribute to the loss, which says that all the output elements of the Q-network corresponding to thetarget[j][:]
excepttarget[j][a[j]]
should be masked.Here, the target values (except
target[j][a[j]]
) are set to zero instead of masking network's output, which means the target value of zero is used and the Q-network is trained toward valuezero
for all other actions thana[j]
, when we are at the states[j, :]
.Am I missing something?
EDIT:
It is OK (though a bit confusing). The difference has actually been put in the target and not the target itself.