Closed MasterScrat closed 4 years ago
Hi, thanks for bringing this up, we'll be posting a new set of results soon.
there is probably a bug in evaluation for dqn-like methods as dopamine dqn from google evaluate for a fixed amount of step (125000): https://github.com/google/dopamine/blob/master/dopamine/agents/dqn/configs/dqn_nature.gin while slm_lab only evaluate once for each parallel env (I think?) as can see in the function gen_return of the analysis file. this is a much noisier estimate, especially for Breakout, sometimes the agent gets around 20-30 score but other times it gets 200-300.
Please see the latest benchmark page with the full Atari results: https://github.com/kengz/SLM-Lab/blob/master/BENCHMARK.md However this benchmark is run on both smaller networks and replay memories, and the performance for breakout is lower. There are some variability in the results: some environments perform better, equal, and worse to their originally reported scores.
Describe the bug The Breakout DDQN+PER benchmark is surprisingly low, with a maximum score under 150 and a final score under 80. The original paper shows final performance between 320 and 400 for this environment (although it was evaluated on a single seed).
To Reproduce
Additional context Note that Breakout really shouldn't be problematic to solve. I am a bit worried to see the shape of the training graph: https://user-images.githubusercontent.com/8209263/62100441-9ba13900-b246-11e9-9373-95c6063915ab.png - I am not yet familiar with this codebase but I would suspect a bug.