Chapter 19, Reinforcement Learning, p. 691

Maryisme commented 1 month ago

I would like to get a slightly better understanding regarding the difference between the on-policy and off-policy as well as some clarifications regarding the formulas used to apply them. Namely, what I am also interested in is the difference between "A" and "a" used in these formulas.

Maryisme commented 1 month ago

d-kleine commented 1 month ago

I am not involved in the book, but I am trying to answer your questions:

Methods:
- On-policy: Learning the value of the policy being carried out by the agent, including any exploration steps; the policy used to select actions is the same policy that is being evaluated and improved. So same policy for action and learning (e.g. in SARSA).
- Off-policy: Learning the value of an optimal policy independently of the agent's actions. The learning process is separate from the actions taken by the agent during exploration. So different policies for action and learning (e.g. in Q-Learning) → you can consider this as different methods of learning a policy
Formulas:
- In SARSA (left formula), the algorithm updates the action-value function based on the actual actions taken, making it an on-policy method (it updates the Q-value based on the action actually taken at the next time step based on the policy, therefore $t-1$).
- Q-Learning (right formula), where the update is based on the maximum possible action value for the next state, regardless of the action actually taken, making it an off-policy method (it updates the Q-value based on the maximum possible action value at the next state regardless of the current policy's actions, therefore $t$)

→ SARSA uses the actual action taken ($A$), while Q-Learning considers the best possible action ($a$) in the next state. So $A$ denotes a specific action chosen by the policy, whereas ($a$) indicates any possible action, reflecting its off-policy nature, where the update is based on the optimal future action rather than the one actually taken. This is afaik because of different notations, you often see it denoted as $a$ both in SARSA as well as in Q-Learning (actually $a$ means an action and $A$ the set of possible actions in an environment, so might be highly confusing here).

https://tcnguyen.github.io/reinforcement_learning/sarsa_vs_q_learning.html

rasbt / machine-learning-book

Chapter 19, Reinforcement Learning, p. 691 #189