philtabor / Deep-Q-Learning-Paper-To-Code

MIT License
351 stars 146 forks source link

Regarding target calculations in DuelDDQN and indices #8

Open EXJUSTICE opened 4 years ago

EXJUSTICE commented 4 years ago

Hi Phil,

I implemented your DuelDDQN architecture for myself, and was curious as to the following snippet of the learning function, as my question wasn't detailed in the course.

        q_pred = T.add(V_s,
                        (A_s - A_s.mean(dim=1, keepdim=True)))[indices, actions]

        q_next = T.add(V_s_, (A_s_ - A_s_.mean(dim=1, keepdim=True)))

        q_eval = T.add(V_s_eval, (A_s_eval - A_s_eval.mean(dim=1,keepdim=True)))

Why is it that only Q_pred is muliplied by the action maatrix, is it because it represents the actions we have just taken in the current state? Are all of these q_value matrices of the same dimensions?