openai / baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
MIT License
15.63k stars 4.86k forks source link

Cannot reproduce Breakout benchmark using Double DQN #176

Open gbg141 opened 6 years ago

gbg141 commented 6 years ago

I haven't been able to reproduce the results of the Breakout benchmark with Double DQN when using similar hyperparameter values than the ones presented in the original paper. After more than 20M observed frames (~100.000 episodes), the mean 100 episode reward still remains around 10, having achieved a maximum value of 12.

I present in the following list the neural network configuration as well as the hyperparameter values that I'm using in case I'm missing or getting something important wrong:

env = gym.make("BreakoutNoFrameskip-v4")
env = ScaledFloatFrame(wrap_dqn(env))
model = deepq.models.cnn_to_mlp(
        convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
        hiddens=[512],
        dueling=False
)
act = deepq.learn(
        env,
        q_func=model,
        lr=25e-5,
        max_timesteps=200000000,
        buffer_size=100000, #cannot store 1M frames as the paper suggests
        exploration_fraction=1000000/float(200000000), #so as to finish after !M steps
        exploration_final_eps=0.1,
        train_freq=4,
        batch_size=32,
        learning_starts=50000,
        target_network_update_freq=10000,
        gamma=0.99,
        prioritized_replay=False
)

Does anyone have some idea of what is going wrong? The analogous results exposed in a jupyter notebook in openai/baselines-results indicate that I should be able to get much better scores.

Thanks in advance.

asimmunawar commented 6 years ago

Same here, I also get similar average rewards. I also ran the deepq/experiments/run_atari.py example without any modifications and still it just converge to 11 in around 5 million steps. Any help or suggestion would be appreciated.

candytalking commented 6 years ago

I observe the same problem when training using the "learn" function within the "simple.py" file, which is the case to use "run_atari.py". When training using "deepq/experiments/atari/train.py" instead it works fine.

BNSneha commented 6 years ago

File "train.py", line 244, in start_time, start_steps = time.time(), info['steps'] KeyError: 'steps' How to get rid of this error when trying to run atari/train.py?

btaba commented 6 years ago

I have been running train.py in baselines/deepq/experiments/atari/train.py with the following command python train.py --env BeamRider --save-dir 'savedir-dueling' --dueling --prioritized, and I also cannot reproduce results for BeamRider compared to the jupyter notebook (although it seems that the train.py script was used to create those benchmarks). I had to make minor corrections to run the script due to issues referenced by the comment directly above and from this ticket. I'm effectively running this lightly modified version.

btaba commented 6 years ago

@gbg141 Part of your issue might be that the rewards from the environment wrapped with wrap_deepmind are by default clipped to -1, 1 using np.sign, so the reported rewards in deepq/experiments/atari/train.py are clipped. If you turn reward clipping off and explicitly save the clipped reward in the replay buffer for training, that might work for you.

kdu4108 commented 6 years ago

@btaba Did you try this/did it work for you?

btaba commented 6 years ago

@kdu4108 that actually didn't work for me, I also tried a git reset --hard 1f3c3e33e7891cb3 and wasn't able to reproduce the results in this notebook for Breakout.

Edit: trained on commit 1f3c3e33e7891cb3 using python train.py --env Breakout --target-update-freq 10000 --learning-freq 4 --prioritized --dueling for 50M frames and I am only able to reach a reward of ~250 as opposed to ~400.

kdu4108 commented 6 years ago

@btaba Okay thanks for the response. I tried training the default Pong using that version and successfully reproduced their results. Out of curiosity, have you tried to reproduce results on any other environments using that commit? Or, have you tested any later commits that might hold fixes for the Breakout reward difference?

btaba commented 6 years ago

@kdu4108 I only tried that commit on Breakout and BeamRider, and was not able to reproduce results

AshishMehtaIO commented 6 years ago

I'm facing the same issue. The only major difference between DQN paper and baselines implementation is the Optimizer. (rmsprop vs ADAM). is there a major difference when using one or the other?

btaba commented 6 years ago

@ashishm-io You can try. Don't forget to log the actual episode rewards and not the clipped ones.

I find this DQN implementation to actually work. It's probably easier from there to add double-q and dueling networks.

AshishMehtaIO commented 6 years ago

Shouldn't Baselines log both clipped and episode rewards by default? Isn't that an essential feature to compare results with other implementations?

benbotto commented 6 years ago

@ashishm-io Another difference is the size of the replay buffer. You might try bumping that to 1e6, because by default it's only 1e4. Note that in run_atari.py the ScaledFloatFrame wrapper is used, so 32-bit floats are used to store observations rather than 8-bit ints. In other words, you'll need a ton of memory!

@kdu4108 Yea, but Pong is the simplest of the Atari games as far as I know. In my implementation I achieve an average of over 20 in about 3 million frames. Breakout is significantly harder.

@btaba When you achieved the 250 average, that's the actual score, right? As opposed to the clipped score? And also, is that with or without episodic life? In other words, is that an average of 250 in one life, or in 5 lives?

OpenAI team: How do we reproduce what's reported in the baselines-results repository (https://github.com/openai/baselines-results/blob/master/dqn_results.ipynb)? It shows average scores of 400+; however, it references files that no longer exist, like wang2015_eval.py. I'm using the run_atari.py script, with dueling off but otherwise default, and getting an average of just over 18 after 10M frames (the default). I'm trying to implement DQN, but most of the code I find online has subtle bugs. It's important to have something out there to reference that has reproducible results!

ppwwyyxx commented 6 years ago

@benbotto The implementation I open sourced two years ago (https://github.com/ppwwyyxx/tensorpack/tree/master/examples/DeepQNetwork) can reproduce 400+ average score on Breakout within 10 hours on one GTX1080Ti.

benbotto commented 6 years ago

Thank you @ppwwyyxx, I'll definitely run your implementation and compare the results against my own. I'm able to reproduce the 400 score as well in my code in vanilla DQN, but I'm running into trouble with Prioritized Experience Replay. This is the only implementation that I know of that uses PER and takes into account Importance Sampling Weights: most forgo that last part. I've found this implementation, which does not correctly normalize the weights. There's also this one, which ignores the IS weights altogether. The Baselines implementation looks right to me--aside from a minor off-by-one bug that's awaiting a pull. That said, it would be nice to be able to reliably reproduce the reported numbers in the baselines-results repository!