openai / spinningup

An educational resource to help anyone learn deep reinforcement learning.
https://spinningup.openai.com/
MIT License
10.13k stars 2.22k forks source link

A2C/A3C: don't they use Q-learning? #156

Closed MasterScrat closed 5 years ago

MasterScrat commented 5 years ago

On this page: https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html

More specifically in this diagram: https://spinningup.openai.com/en/latest/_images/rl_algorithms_9_15.svg

I am surprised that the "A2C/A3C" box doesn't make use of Q-learning. They are, by their names, Actor-Critic methods. The critic uses Q-learning to learn the value of the states. As such, shouldn't they be connected to the "Q-learning" branch?

jachiam commented 5 years ago

Hi @MasterScrat!

Actor-critic is a generic term, like an adjective, that describes a wide range of RL algorithms. An actor-critic algorithm has both

The critic in an actor-critic algorithm can be any kind of value function: either an on-policy value function (V^pi (s)), an optimal value function (V* (s)), an on-policy action-value function (Q^pi (s,a)), or an optimal action-value function (Q^* (s,a)). It doesn't just refer to Q-functions.

A2C/A3C learns an on-policy value function that only takes state as an argument (V^pi (s)), not a Q-function. As such, they should not be connected to the Q-learning branch.

Critics that approximate value functions (as opposed to action-value functions, AKA Q-functions) show up in every modern policy optimization algorithm (eg VPG, TRPO, PPO, A2C/A3C, and others). As a result, while "actor-critic" was historically a useful term for distinguishing between some different kinds of algorithms (eg distinguishing policy optimization algorithms that used Monte Carlo return estimates, which were actor-only methods, from policy optimization algorithms that used critic-based advantage estimation, which were actor-critic methods), it is no longer a specific-enough characteristic to carry useful info.

Hopefully this is helpful!