rasbt / machine-learning-book

Code Repository for Machine Learning with PyTorch and Scikit-Learn
https://sebastianraschka.com/books/#machine-learning-with-pytorch-and-scikit-learn
MIT License
3.32k stars 1.21k forks source link

Chapter 19, Reinforcement Learning, p. 691 #189

Open Maryisme opened 1 month ago

Maryisme commented 1 month ago

I would like to get a slightly better understanding regarding the difference between the on-policy and off-policy as well as some clarifications regarding the formulas used to apply them. Namely, what I am also interested in is the difference between "A" and "a" used in these formulas.

Maryisme commented 1 month ago
Screenshot 2024-07-31 at 16 54 39 Screenshot 2024-07-31 at 16 54 45
d-kleine commented 1 month ago

I am not involved in the book, but I am trying to answer your questions:

→ SARSA uses the actual action taken ($A$), while Q-Learning considers the best possible action ($a$) in the next state. So $A$ denotes a specific action chosen by the policy, whereas ($a$) indicates any possible action, reflecting its off-policy nature, where the update is based on the optimal future action rather than the one actually taken. This is afaik because of different notations, you often see it denoted as $a$ both in SARSA as well as in Q-Learning (actually $a$ means an action and $A$ the set of possible actions in an environment, so might be highly confusing here).

https://tcnguyen.github.io/reinforcement_learning/sarsa_vs_q_learning.html