Open Maryisme opened 1 month ago
I am not involved in the book, but I am trying to answer your questions:
Methods:
Formulas:
→ SARSA uses the actual action taken ($A$), while Q-Learning considers the best possible action ($a$) in the next state. So $A$ denotes a specific action chosen by the policy, whereas ($a$) indicates any possible action, reflecting its off-policy nature, where the update is based on the optimal future action rather than the one actually taken. This is afaik because of different notations, you often see it denoted as $a$ both in SARSA as well as in Q-Learning (actually $a$ means an action and $A$ the set of possible actions in an environment, so might be highly confusing here).
https://tcnguyen.github.io/reinforcement_learning/sarsa_vs_q_learning.html
I would like to get a slightly better understanding regarding the difference between the on-policy and off-policy as well as some clarifications regarding the formulas used to apply them. Namely, what I am also interested in is the difference between "A" and "a" used in these formulas.