One of the first papers from DeepMind.
Introduced deep Q networks, which became state of the art in reinforcement learning.
Goal: play a game with only the pixels and rewards as input
Background: Reinforcement Learning
At each time step:
the agent chooses an action _at from a fixed set of possible actions
the game state changes, and the agent only observes:
the new image _xt of the screen
the reward _rt equals to the change in score (often 0.)
We observe sequences of states and actions:
s_t = x1, a1, x2, a2 ... , x_t
And we learn game strategies on these sequences of actions: given a sequence, we try to predict the next action to take.
Q-Learning
For each pair (sequence, action) (s, a), we try to determine the Q-value of (s, a).
The Q-value is defined as the maximum discounted return possible in (s, a) among all the possible policies.
The Bellman equation states that:
If we know the optimal values of Q* for the next iteration, we can deduce it to the actual iteration.
One easy algorithm would be to iteratively set: Q(s, a) = reward + max(Q(s', a')), but this approach ignores the redundancies among the couples (s, a). (For each new couple (s, a), we have to learn from scratch its Q-value).
Q-Network
Instead of just learning every value of Q for every couple (s, a), Q-networks try to predict the Q-value with the input (s, a) and its weights:
Usually, the Q-network can be a linear regression, or a neural network. It is trained minimizing a sequence of losses at each iteration i:
where y_i is the target for iteration i, computed from the previous state of the Q-network at iteration i-1.
ρ is the behavior distribution that gives the next action at each state s. Often, ρ is an ε-greedy strategy that follows the greedy strategy with probabilité 1-ε and take a random action with probability ε.
Details of the architecture
The deep-Q network is composed of:
input: the state representation
for instance, in Atari games it consists in the 4 previous images of the game (in fact they skip 3 images over 4, so it is the 12th, 8th, 4th previous images + current image)
model: usually a neural network
for Atari games, as the input is 4x84x84 (4 images of 84x84 grey pixels), the network is a convolutional neural network
the final layer is a softmax layer of size the number of possible actions
output: probability distribution over the different actions.
Algorithm
Replay memory
Each transition is stored in the replay memory. The maximum capacity of the memory is set to N. When full, the oldest episodes are deleted from the memory.
Batch updates
At each iteration, the algorithm samples a minibatch of transitions from the memory and performs a gradient descent update on the network.
This allows less noise and a better overall convergence.
Playing Atari With Deep Reinforcement Learning
http://arxiv.org/abs/1312.5602
One of the first papers from DeepMind.
Introduced deep Q networks, which became state of the art in reinforcement learning.
Goal: play a game with only the pixels and rewards as input
Background: Reinforcement Learning
At each time step:
We observe sequences of states and actions:
s_t = x1, a1, x2, a2 ... , x_t
And we learn game strategies on these sequences of actions: given a sequence, we try to predict the next action to take.Q-Learning
For each pair (sequence, action)
(s, a)
, we try to determine the Q-value of (s, a).The Q-value is defined as the maximum discounted return possible in (s, a) among all the possible policies.
The Bellman equation states that: If we know the optimal values of Q* for the next iteration, we can deduce it to the actual iteration.
One easy algorithm would be to iteratively set:
Q(s, a) = reward + max(Q(s', a'))
, but this approach ignores the redundancies among the couples (s, a). (For each new couple (s, a), we have to learn from scratch its Q-value).Q-Network
Instead of just learning every value of Q for every couple (s, a), Q-networks try to predict the Q-value with the input (s, a) and its weights:
Usually, the Q-network can be a linear regression, or a neural network. It is trained minimizing a sequence of losses at each iteration i:
where y_i is the target for iteration i, computed from the previous state of the Q-network at iteration
i-1
.ρ is the behavior distribution that gives the next action at each state s. Often, ρ is an ε-greedy strategy that follows the greedy strategy with probabilité
1-ε
and take a random action with probabilityε
.Details of the architecture
The deep-Q network is composed of:
Algorithm
Replay memory
Each transition is stored in the replay memory. The maximum capacity of the memory is set to N. When full, the oldest episodes are deleted from the memory.
Batch updates
At each iteration, the algorithm samples a minibatch of transitions from the memory and performs a gradient descent update on the network.
This allows less noise and a better overall convergence.