Closed laz8 closed 4 years ago
Closed, i was confused by different versions of a DDQN.
It is explained here:
What makes this network a Double DQN?
The Bellman equation used to calculate the Q values to update the online network follows the equation:
value = reward + discount_factor * target_network.predict(next_state)[argmax(online_network.predict(next_state))]
The Bellman equation used to calculate the Q value updates in the original (vanilla) DQN[1] is:
value = reward + discount_factor * max(target_network.predict(next_state))
The difference is that, using the terminology of the field, the second equation uses the target network for both SELECTING and EVALUATING the action to take whereas the first equation uses the online network for SELECTING the action to take and the target network for EVALUATING the action. Selection here means choosing which action to take, and evaluation means getting the projected Q value for that action. This form of the Bellman equation is what makes this agent a Double DQN and not just a DQN and was introduced in [2].
And also the names confused me, everything is a target, you renamed a lot of stuff that makes it harder to understand your code.
But it seems to be correct.
You are doing Q-Learning:
https://github.com/rlcode/reinforcement-learning/blob/2fe6984da684c3f64a8d09d1718dbac9330aecea/2-cartpole/2-double-dqn/cartpole_ddqn.py#L111
But isn't that SARSA?
Is that a mistake or is that a valid approach? I'm new to RL...