An important logical issue

fatemi commented 8 years ago

In the DQN loss, the update should only happen for the observed action. Assume a mini-batch of (s, a, r, s2) with m samples in the mini-batch (therefore, s and s2 would be m x n if n is the nb of features and so on). Then, for sample j in the mini-batch, only a[j] should contribute to the loss, which says that all the output elements of the Q-network corresponding to the target[j][:] except target[j][a[j]] should be masked.

Here, the target values (except target[j][a[j]]) are set to zero instead of masking network's output, which means the target value of zero is used and the Q-network is trained toward value zero for all other actions than a[j], when we are at the state s[j, :].

Am I missing something?

EDIT:

It is OK (though a bit confusing). The difference has actually been put in the target and not the target itself.

mthrok commented 8 years ago

@fatemi Can you point the code you are talking about? To me, it looks like the code here does what you say.

mthrok commented 8 years ago

oops, actually I got your point after my comment.

kuz / DeepMind-Atari-Deep-Q-Learner

An important logical issue #22