Why are you using SARSA instead of Q-Learning?

Closed, i was confused by different versions of a DDQN.

It is explained here:

What makes this network a Double DQN?

The Bellman equation used to calculate the Q values to update the online network follows the equation:

value = reward + discount_factor * target_network.predict(next_state)[argmax(online_network.predict(next_state))]

The Bellman equation used to calculate the Q value updates in the original (vanilla) DQN[1] is:

value = reward + discount_factor * max(target_network.predict(next_state))

The difference is that, using the terminology of the field, the second equation uses the target network for both SELECTING and EVALUATING the action to take whereas the first equation uses the online network for SELECTING the action to take and the target network for EVALUATING the action. Selection here means choosing which action to take, and evaluation means getting the projected Q value for that action. This form of the Bellman equation is what makes this agent a Double DQN and not just a DQN and was introduced in [2].

https://medium.com/@leosimmons/double-dqn-implementation-to-solve-openai-gyms-cartpole-v-0-df554cd0614d

And also the names confused me, everything is a target, you renamed a lot of stuff that makes it harder to understand your code.

But it seems to be correct.

rlcode / reinforcement-learning

Why are you using SARSA instead of Q-Learning? #94