Playing Atari With Deep Reinforcement Learning

http://arxiv.org/abs/1312.5602

One of the first papers from DeepMind.
Introduced deep Q networks, which became state of the art in reinforcement learning.

Goal: play a game with only the pixels and rewards as input

Background: Reinforcement Learning

At each time step:

the agent chooses an action _at from a fixed set of possible actions
the game state changes, and the agent only observes:
- the new image _xt of the screen
- the reward _rt equals to the change in score (often 0.)

We observe sequences of states and actions: s_t = x1, a1, x2, a2 ... , x_t And we learn game strategies on these sequences of actions: given a sequence, we try to predict the next action to take.

Q-Learning

For each pair (sequence, action) (s, a), we try to determine the Q-value of (s, a).
The Q-value is defined as the maximum discounted return possible in (s, a) among all the possible policies.
The Bellman equation states that: capture d ecran 2016-04-26 a 15 52 58 If we know the optimal values of Q* for the next iteration, we can deduce it to the actual iteration.
One easy algorithm would be to iteratively set: Q(s, a) = reward + max(Q(s', a')), but this approach ignores the redundancies among the couples (s, a). (For each new couple (s, a), we have to learn from scratch its Q-value).

Q-Network

Instead of just learning every value of Q for every couple (s, a), Q-networks try to predict the Q-value with the input (s, a) and its weights:

Usually, the Q-network can be a linear regression, or a neural network. It is trained minimizing a sequence of losses at each iteration i: losses

where y_i is the target for iteration i, computed from the previous state of the Q-network at iteration i-1. capture d ecran 2016-04-26 a 16 45 42

ρ is the behavior distribution that gives the next action at each state s. Often, ρ is an ε-greedy strategy that follows the greedy strategy with probabilité 1-ε and take a random action with probability ε.

Details of the architecture

The deep-Q network is composed of:

input: the state representation
- for instance, in Atari games it consists in the 4 previous images of the game (in fact they skip 3 images over 4, so it is the 12th, 8th, 4th previous images + current image)
model: usually a neural network
- for Atari games, as the input is 4x84x84 (4 images of 84x84 grey pixels), the network is a convolutional neural network
- the final layer is a softmax layer of size the number of possible actions
output: probability distribution over the different actions.
Algorithm

capture d ecran 2016-04-26 a 15 37 08

Replay memory

Each transition is stored in the replay memory. The maximum capacity of the memory is set to N. When full, the oldest episodes are deleted from the memory.

Batch updates

At each iteration, the algorithm samples a minibatch of transitions from the memory and performs a gradient descent update on the network.
This allows less noise and a better overall convergence.

omoindrot / papers