Open ParmpalGill opened 4 years ago
I think its because actions is a 1hot vector and there is 1 only in the chosen action, So multiplying will give you a vector of zeros instead of one place which will hold the qvalue. the reduce_sum just gets this number out because all the rest are zeros. What do you think?
why multiply by action and use reduce sum instead of argmax?