Closed 0xJchen closed 3 years ago
Hi,
I am sorry for the late reply.
It is mentioned in the DQN paper "Human-level control through deep reinforcement learning",
The behaviour policy during training was -greedy with annealed linearly from 1.0 to 0.1 over the first million frames, and fixed at 0.1 thereafter.
In the early training, the agent knows nothing with the environments and should explore more with higher . While the capability improves, decrease accordingly.
May this reply helps :).
Got it. Thanks!
Thanks for the great work. One small question: Is there any reference for the varying epsilon along training?