google-deepmind / dqn_zoo

DQN Zoo is a collection of reference implementations of reinforcement learning agents developed at DeepMind based on the Deep Q-Network (DQN) agent.
Apache License 2.0
451 stars 78 forks source link

DQN epsilon-greedy strategy #11

Closed liziniu closed 3 years ago

liziniu commented 3 years ago

Hello,

I notice a small difference between the implementation and the original paper description about the epsilon-greedy in DQN. Maybe I have a wrong understanding.

In particular, the original paper claims that the epsilon-greedy starts after 50K frames rather than (stacked frames) and stops to be 0.1 after 1M frames. However, it seems that the current code implements to start after 200K frames and stops after 4M frames.

My observation is based on lines 130-136 of the following file:

https://github.com/deepmind/dqn_zoo/blob/master/dqn_zoo/dqn/run_atari.py

Could you help explain this difference?

Thanks Ziniu

GeorgOstrovski commented 3 years ago

Hi Ziniu,

thanks for your interest in the code - there is a number of subtle distinctions here, let me try to help & clarify things.

Unfortunately in the original Nature paper it wasn't 100% clear where the text was referring to "environment frames" (of which the training run has 200M) and "agent steps" (of which the training run has 50M). The difference results from frame skipping / action-repetition - namely, the agent receives every 4th frame from the environment, and its action is repeated to the environment 4 times. Note that this is distinct from frame stacking (!) - frame stacking happens on the agent side AFTER frame skipping, so if the environment emits frames [1, 2, 3, 4, .... ], then the agent receives the frames [4, 8, 12, 16, 20, ...] and its 4-frame stack observations are [4, 8, 12, 16], then [8, 12, 16, 20], etc.

In fact, in the original DQN code, both the learning updates (batches sampled from replay & used for training) and the epsilon-decay started at the same time, namely after 50K agent steps, which is the same as 200K environment frames. You can see this in the original lua implementation here.

It is the same in this codebase, as you can see here. In the DQN zoo we tried to express everything in terms of environment frames to avoid this confusion, but our algorithm should be identical to the original.

Hope this helps, Georg

liziniu commented 3 years ago

Hi Georg,

Thanks for your clarification. It helps a lot!

Ziniu