Breakout DDQN+PER benchmark surprisingly low?

MasterScrat commented 4 years ago

Describe the bug The Breakout DDQN+PER benchmark is surprisingly low, with a maximum score under 150 and a final score under 80. The original paper shows final performance between 320 and 400 for this environment (although it was evaluated on a single seed).

To Reproduce

Open PER paper: https://arxiv.org/abs/1511.05952, figure 7. See also table 6: Baseline DQN is 1149%, DDQN-PER are 1298% (rank-based) and 1407% (proportional)
Observe Atari Breakout benchmark results for DDQN+PER: https://github.com/kengz/SLM-Lab/blob/master/BENCHMARK.md#atari-benchmark
Performance is expected to be similar, but SLM-Lab performance is much lower.

Additional context Note that Breakout really shouldn't be problematic to solve. I am a bit worried to see the shape of the training graph: https://user-images.githubusercontent.com/8209263/62100441-9ba13900-b246-11e9-9373-95c6063915ab.png - I am not yet familiar with this codebase but I would suspect a bug.

kengz commented 4 years ago

Hi, thanks for bringing this up, we'll be posting a new set of results soon.

minhnhat93 commented 4 years ago

there is probably a bug in evaluation for dqn-like methods as dopamine dqn from google evaluate for a fixed amount of step (125000): https://github.com/google/dopamine/blob/master/dopamine/agents/dqn/configs/dqn_nature.gin while slm_lab only evaluate once for each parallel env (I think?) as can see in the function gen_return of the analysis file. this is a much noisier estimate, especially for Breakout, sometimes the agent gets around 20-30 score but other times it gets 200-300.

kengz commented 4 years ago

Please see the latest benchmark page with the full Atari results: https://github.com/kengz/SLM-Lab/blob/master/BENCHMARK.md However this benchmark is run on both smaller networks and replay memories, and the performance for breakout is lower. There are some variability in the results: some environments perform better, equal, and worse to their originally reported scores.

kengz / SLM-Lab

Breakout DDQN+PER benchmark surprisingly low? #420