omoindrot / papers

Summaries of the papers I read
1 stars 0 forks source link

Playing Atari With Deep Reinforcement Learning #3

Open omoindrot opened 8 years ago

omoindrot commented 8 years ago

Playing Atari With Deep Reinforcement Learning

http://arxiv.org/abs/1312.5602

One of the first papers from DeepMind.
Introduced deep Q networks, which became state of the art in reinforcement learning.

Goal: play a game with only the pixels and rewards as input

Background: Reinforcement Learning

At each time step:

We observe sequences of states and actions: s_t = x1, a1, x2, a2 ... , x_t And we learn game strategies on these sequences of actions: given a sequence, we try to predict the next action to take.

Q-Learning

For each pair (sequence, action) (s, a), we try to determine the Q-value of (s, a).
The Q-value is defined as the maximum discounted return possible in (s, a) among all the possible policies.
The Bellman equation states that: capture d ecran 2016-04-26 a 15 52 58 If we know the optimal values of Q* for the next iteration, we can deduce it to the actual iteration.
One easy algorithm would be to iteratively set: Q(s, a) = reward + max(Q(s', a')), but this approach ignores the redundancies among the couples (s, a). (For each new couple (s, a), we have to learn from scratch its Q-value).

Q-Network

Instead of just learning every value of Q for every couple (s, a), Q-networks try to predict the Q-value with the input (s, a) and its weights: Q-network

Usually, the Q-network can be a linear regression, or a neural network. It is trained minimizing a sequence of losses at each iteration i: losses

where y_i is the target for iteration i, computed from the previous state of the Q-network at iteration i-1. capture d ecran 2016-04-26 a 16 45 42

ρ is the behavior distribution that gives the next action at each state s. Often, ρ is an ε-greedy strategy that follows the greedy strategy with probabilité 1-ε and take a random action with probability ε.

Details of the architecture

The deep-Q network is composed of:

capture d ecran 2016-04-26 a 15 37 08

Replay memory

Each transition is stored in the replay memory. The maximum capacity of the memory is set to N. When full, the oldest episodes are deleted from the memory.

Batch updates

At each iteration, the algorithm samples a minibatch of transitions from the memory and performs a gradient descent update on the network.
This allows less noise and a better overall convergence.