A2C/A3C: don't they use Q-learning?

Hi @MasterScrat!

Actor-critic is a generic term, like an adjective, that describes a wide range of RL algorithms. An actor-critic algorithm has both

an actor (a policy with some learnable parameters, where at least some of those parameters only belong to the actor)
and a critic: a value function approximator with learnable parameters, where at least some of those parameters only belong to the critic).

The critic in an actor-critic algorithm can be any kind of value function: either an on-policy value function (V^pi (s)), an optimal value function (V* (s)), an on-policy action-value function (Q^pi (s,a)), or an optimal action-value function (Q^* (s,a)). It doesn't just refer to Q-functions.

A2C/A3C learns an on-policy value function that only takes state as an argument (V^pi (s)), not a Q-function. As such, they should not be connected to the Q-learning branch.

Critics that approximate value functions (as opposed to action-value functions, AKA Q-functions) show up in every modern policy optimization algorithm (eg VPG, TRPO, PPO, A2C/A3C, and others). As a result, while "actor-critic" was historically a useful term for distinguishing between some different kinds of algorithms (eg distinguishing policy optimization algorithms that used Monte Carlo return estimates, which were actor-only methods, from policy optimization algorithms that used critic-based advantage estimation, which were actor-critic methods), it is no longer a specific-enough characteristic to carry useful info.

Hopefully this is helpful!

openai / spinningup

A2C/A3C: don't they use Q-learning? #156