rlcode / reinforcement-learning

Minimal and Clean Reinforcement Learning Examples
MIT License
3.35k stars 725 forks source link

Why are you using SARSA instead of Q-Learning? #94

Closed laz8 closed 4 years ago

laz8 commented 4 years ago

You are doing Q-Learning:

            # get action for the current state and go one step in environment
            action = agent.get_action(state)
            next_state, reward, done, info = env.step(action)

https://github.com/rlcode/reinforcement-learning/blob/2fe6984da684c3f64a8d09d1718dbac9330aecea/2-cartpole/2-double-dqn/cartpole_ddqn.py#L111

But isn't that SARSA?

                a = np.argmax(target_next[i])
                target[i][action[i]] = reward[i] + self.discount_factor * (target_val[i][a])

Is that a mistake or is that a valid approach? I'm new to RL...

laz8 commented 4 years ago

Closed, i was confused by different versions of a DDQN.

It is explained here:

What makes this network a Double DQN?

The Bellman equation used to calculate the Q values to update the online network follows the equation:

value = reward + discount_factor * target_network.predict(next_state)[argmax(online_network.predict(next_state))]

The Bellman equation used to calculate the Q value updates in the original (vanilla) DQN[1] is:

value = reward + discount_factor * max(target_network.predict(next_state))

The difference is that, using the terminology of the field, the second equation uses the target network for both SELECTING and EVALUATING the action to take whereas the first equation uses the online network for SELECTING the action to take and the target network for EVALUATING the action. Selection here means choosing which action to take, and evaluation means getting the projected Q value for that action. This form of the Bellman equation is what makes this agent a Double DQN and not just a DQN and was introduced in [2].

https://medium.com/@leosimmons/double-dqn-implementation-to-solve-openai-gyms-cartpole-v-0-df554cd0614d

And also the names confused me, everything is a target, you renamed a lot of stuff that makes it harder to understand your code.

But it seems to be correct.