Closed tonegas closed 6 years ago
I have modified how is obtained the value of target[0][action] =... because in the paper https://arxiv.org/abs/1312.5602, this value is obtained as
target[0][action] =...
r_j + gamma * max_a'( Qhat( state{j+1} , a' ) )
and not as in the previous code as
r_j + gamma + Qhat( state{j+1} , argmaxa'( Q( state{j+1} , a' ) )'
With this fix the algorithm is more stable.
Sorry it took forever to merge this PR. Thanks!
I have modified how is obtained the value of
target[0][action] =...
because in the paper https://arxiv.org/abs/1312.5602, this value is obtained asand not as in the previous code as
With this fix the algorithm is more stable.