Tune hyperparameters of DQN and fix bug

On CartPole-v1, DQN with the original hyperparameters can achieve a return up to ~200. But with the tuned hyperparameters, it can get 500 and be much more stable. I mainly change the hyperparameters in the following ways:

increase the learning rate to $3e^{-4}$, the best learning rate of Adam;
decrease the target update frequency, making the target not too different from the current network;
increase the buffer size, to store much more samples to improve the evaluation accuracy and sample efficiency;
change the epsilon schedule, including decreasing the epsilon_begin and the epsilon_steps. In fact, CartPole does not need such aggressive exploration. And I guess the reason why the original DQN performs not well is the totally random actions in the buffer for long timesteps, indeed caused by the epsilon schedule and small buffer.

mansicer / baselax

Tune hyperparameters of DQN and fix bug #2