Closed MasterScrat closed 5 years ago
Hi @MasterScrat!
Actor-critic is a generic term, like an adjective, that describes a wide range of RL algorithms. An actor-critic algorithm has both
an actor (a policy with some learnable parameters, where at least some of those parameters only belong to the actor)
and a critic: a value function approximator with learnable parameters, where at least some of those parameters only belong to the critic).
The critic in an actor-critic algorithm can be any kind of value function: either an on-policy value function (V^pi (s)
), an optimal value function (V* (s)
), an on-policy action-value function (Q^pi (s,a)
), or an optimal action-value function (Q^* (s,a)
). It doesn't just refer to Q-functions.
A2C/A3C learns an on-policy value function that only takes state as an argument (V^pi (s)
), not a Q-function. As such, they should not be connected to the Q-learning branch.
Critics that approximate value functions (as opposed to action-value functions, AKA Q-functions) show up in every modern policy optimization algorithm (eg VPG, TRPO, PPO, A2C/A3C, and others). As a result, while "actor-critic" was historically a useful term for distinguishing between some different kinds of algorithms (eg distinguishing policy optimization algorithms that used Monte Carlo return estimates, which were actor-only methods, from policy optimization algorithms that used critic-based advantage estimation, which were actor-critic methods), it is no longer a specific-enough characteristic to carry useful info.
Hopefully this is helpful!
On this page: https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html
More specifically in this diagram: https://spinningup.openai.com/en/latest/_images/rl_algorithms_9_15.svg
I am surprised that the "A2C/A3C" box doesn't make use of Q-learning. They are, by their names, Actor-Critic methods. The critic uses Q-learning to learn the value of the states. As such, shouldn't they be connected to the "Q-learning" branch?