Simple-DQN-Pytorch

This is a simplistic implementation of DQN that works under CartPole-v0 with rendered pixels as input. It extends the implementation of pytorch's official DQN tutorial (which doesn't actually work) https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html.

It's tuned specifically for CartPole-v0 and will possibly fail for other tasks.

Requirements

pytorch 1.0+
torchvision
OpenAI gym
matplotlib
Pillow

Performance

Due to the randomness of the DQN algorithm (random training samples & random network initialization), the training is not deterministic. You may have to restart a few times to get a satisfying result.

The following 3 experiment results came from the same set of hyper parameters:

Trial	Max reward	Max 100-mean	Total episodes	Solved after (episodes)
0	1600	220	5000	3000
1	900	160	5000	-
2	2500	500	10000	700

The last column indicates when the 100-mean reached 200.

image not available

Last 300 episode history of training:

image not available

Blue curve: episode reward
Red curve: episode loss
Orange curve: 100-mean
Higher green line: max reward
Lower green line: max 100-mean

Implementation

Methods

Preprocess:
- stack last 3 frames
- grayscale, crop, resize
Net work:
- conv: 4 layers with a kernel size of 5 and stride of 2
- flatten: simply reshape all pixels from conv output
- output: fully connect flattened features to produce 1 output for Value and 2 outputs for Advantage
Double DQN
Duelling DQN
Prioritized replay memory
Adam optimizer
MSE loss
Separate render thread

Hyper-parameters

Parameter	Value
Learning rate	3e-5
Target net update (steps)	200
Batch size	256
Gamma	1
Memory size	10000
Memory alpha	0.6
Memory beta start	0.4
Memory beta frames	10000
eps start	1.0
eps end	0.01
eps decay	10

Other Observations

The training loss never converges, while performance keeps improving.
Training is very sensitive to hyperparameter changes.
Although the 100-mean is quite high and stable after some training, the network may still perform some terrible moves.
It's quite sensitive to initialization, which can only be controlled with a random seed.
High rate of random sample rate (epsilon) seems to ruin the training. Only a small one works.

tqjxlm / Simple-DQN-Pytorch