google / dopamine

Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.
https://github.com/google/dopamine
Apache License 2.0
10.44k stars 1.37k forks source link

Reproducing the scores reported by the IQN paper #37

Open muupan opened 5 years ago

muupan commented 5 years ago

Thank you for opensourcing such a great code!

I have questions about your IQN implementation, especially on how it can reproduce the scores reported by th paper.

First, your config file https://github.com/google/dopamine/blob/master/dopamine/agents/implicit_quantile/configs/implicit_quantile_icml.gin specifies N=N'=64. How did you choose these values?

Second, can the IQN implementation reproduce the scores reported by the paper? I ran it by myself against six games, but the results do not match the paper.

I used this command:

python3 -um dopamine.atari.train '--agent_name=implicit_quantile' '--base_dir=results' '--gin_files=dopamine/agents/implicit_quantile/configs/implicit_quantile_icml.gin' '--gin_bindings=AtariPreprocessing.terminal_on_life_loss=True' "--gin_bindings=Runner.game_name='Breakout'"

Here is the tensorboard plot I got:

image

Seeing Figure 7 of the IQN paper, they report the raw score of 342,016 for Asterix, 42,776 for Beam Rider, 734 for Breakout, 25,750 for Q*Bert, 30,140 for Seaquest, 28,888 for Space Invaders. Have you successfully reproduced scores on the same level? If yes, how? If no, are you aware of any differences of implementation or settings from DeepMind's?

psc-g commented 5 years ago

hi, thank you for your support and reporting this! it turns out there was a very subtle bug in the IQN implementation that was recently fixed here: https://github.com/google/dopamine/commit/2de70a421ad8615fedf2e215a274467c72232347 with that bug fix, we were able to reproduce the published results.

could you verify if you are running with that bug fix?

we are currently re-running the baseline results for IQN on all games and will be releasing these once they are done (should be by next week). i'll add a note in the "What's New" section when this is done.

muupan commented 5 years ago

Thank you. The plot I pasted above is before that commit. I'll try again with the current master branch.

What about my first question on where these values come from?

ImplicitQuantileAgent.num_tau_samples = 64
ImplicitQuantileAgent.num_tau_prime_samples = 64
psc-g commented 5 years ago

see the line right before equation (4) in the paper (n=64): https://arxiv.org/pdf/1806.06923.pdf

muupan commented 5 years ago

I mean N and N' in (3), not n in (4). ImplicitQuantileAgent.num_tau_samples and ImplicitQuantileAgent.num_tau_prime_samples correspond to N and N', respectively, correct?

psc-g commented 5 years ago

ah yes, you're right. but look in the second-to-last paragraph of the same page, where they discuss varying N and N' in {1, 8, 32, 64}. Figure 2 compares the results. Although it suggests N' doesn't change much past 8, we decided to use the larger of these explored values.

muupan commented 5 years ago

I see. Thank you for clarifying it.

muupan commented 5 years ago

Does dopamine follow the up-to-30-noop evaluation protocol used in the paper? I cannot find code that sends noop actions after reset.

psc-g commented 5 years ago

hi, we don't follow the up-to-30-noop evaluation protocol in dopamine. we have chosen to follow the recommendations in Machado et al., 2018 (https://arxiv.org/abs/1709.06009), which does not include that.

muupan commented 5 years ago

Because sticky actions are disabled by the config file and up-to-30-noop is not implemented, I suspect that the current evaluation protocol of implicit_quantile_icml.gin is more deterministic and thus easier than that of the IQN paper. Have you compared the scores of dopamine with and without up-to-30-noop?

muupan commented 5 years ago

I ran implicit_quantile_icml.gin again with the fix. The results are much better now, but on Breakout and Seaquest it didn't reached the paper scores yet. Any ideas?

screen shot 2018-10-22 at 14 51 03 screen shot 2018-10-22 at 14 50 40 screen shot 2018-10-22 at 14 49 53
muupan commented 5 years ago

BTW I sent Will Dabney an email asking the values of N and N' just a month ago, still don't have a reply. Anyone knows the values?

mgbellemare commented 5 years ago

Hi, thanks for the thorough look at IQN! Hopefully by now you received the answer, but N and N' should be as in the implicit_quantile_icml.gin file.

mgbellemare commented 5 years ago

Also, out of curiosity -- did you figure out what was wrong?

muupan commented 5 years ago

N and N' should be as in the implicit_quantile_icml.gin file.

Thank you for the information!

Also, out of curiosity -- did you figure out what was wrong?

I haven't figure it out. The only difference of dopamine's IQN from the paper's I'm aware of is 30-noop, but I don't know how it affect scores. I would really appreciate it if you could share any other differences you are aware of.

muupan commented 5 years ago

I got a reply from Georg Ostrovski and confirmed that N=N'=64. He said the weight initialization was as below:

muupan commented 5 years ago

Dopamine uses 2D convolutions with padding=SAME, which makes the number of activations after the three convolutions be 11*11*64=7744, but it should be padding=VALID, thus 7*7*64=3136 (confirmed by Georg Ostrovski).

muupan commented 5 years ago

I tried padding=VALID for the same set of games by changing these lines:

--- a/dopamine/agents/implicit_quantile/implicit_quantile_agent.py
+++ b/dopamine/agents/implicit_quantile/implicit_quantile_agent.py
@@ -121,13 +121,16 @@ class ImplicitQuantileAgent(rainbow_agent.RainbowAgent):
     state_net = tf.div(state_net, 255.)
     state_net = slim.conv2d(
         state_net, 32, [8, 8], stride=4,
-        weights_initializer=weights_initializer)
+        weights_initializer=weights_initializer,
+        padding='VALID')
     state_net = slim.conv2d(
         state_net, 64, [4, 4], stride=2,
-        weights_initializer=weights_initializer)
+        weights_initializer=weights_initializer,
+        padding='VALID')
     state_net = slim.conv2d(
         state_net, 64, [3, 3], stride=1,
-        weights_initializer=weights_initializer)
+        weights_initializer=weights_initializer,
+        padding='VALID')
     state_net = slim.flatten(state_net)
     state_net_size = state_net.get_shape().as_list()[-1]
     state_net_tiled = tf.tile(state_net, [num_quantiles, 1])

The results seem slightly worse than padding=SAME. Note that the score reported by the paper is 42,776 for Beam Rider.

screen shot 2019-01-07 at 18 19 02
mgbellemare commented 5 years ago

Hi,

Very interesting! So it seems like it makes a small but noticeable difference, possibly due to the way Adam handles step size adaption. What do you think?

psc-g commented 5 years ago

It would be interesting to average this over multiple runs and make sure it's a statistically significant difference. I can try running this after the ICML deadline, unless you beat me to it, Yasuhiro :).

On Mon, Jan 7, 2019, 2:39 PM Marc G. Bellemare <notifications@github.com wrote:

Hi,

Very interesting! So it seems like it makes a small but noticeable difference, possibly due to the way Adam handles step size adaption. What do you think?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/google/dopamine/issues/37#issuecomment-452055944, or mute the thread https://github.com/notifications/unsubscribe-auth/ATYhMVv4zv8YcvnhVuXtg5T62o9cQZMTks5vA6J8gaJpZM4XYJ_h .

muupan commented 5 years ago

FYI, I have pasted the plots of SAME vs VALID for the six games here, although they are all single runs. https://docs.google.com/document/d/1fsYzmNhfLvtPP4Cm-dbtp_MviH5WLo8qeiUJYMgRVio/edit?usp=sharing

muupan commented 5 years ago

possibly due to the way Adam handles step size adaption

Could you elaborate this?

cathera commented 5 years ago

Hi, I am also trying to reproduce the scores of IQN, but according to @psc-g dopamine doesn't follow the 30-noop protocol. Does this mean that if I want to use C51, QR-DQN and IQN as baselines, I will have to redo all the experiments instead of using the scores reported in their paper?

I don't have a lot of resources so it is not likely that I can finish all of them before NeurIPS deadline. So I am wondering if you @muupan figured out the affect of 30-noop yet? Were there significant differences between with and without 30-noop?

muupan commented 5 years ago

@cathera I haven't checked differences between with and without 30-noop.

mgbellemare commented 5 years ago

The 30-noop are a little bit of a hack, and they don't have a big impact on performance. They were designed to discourage open loop policies like The Brute (discussed in the Machado et al., 2018 paper). With sticky actions (same paper), the 30-noop become less relevant.

hh0rva1h commented 1 year ago

@psc-g @mgbellemare So what is the status of this issue here? The baseline scores of IQN at least for beamrider still don't match the paper. I was just about the open a duplicate issue here, therefore please let me paste the text that I have already written:

I have been especially interested in reproducing the atari scores of distributional algorithms (qrdqn / iqn) from the original papers / dqn zoo. I have selected one particular game, where the difference between qrdqn and iqn should be very obvious: BeamRider, see the following plot vom dqn_zoo (see https://github.com/deepmind/dqn_zoo/blob/master/plot_atari_individual.svg): image where the green line corresponds to IQN and the red one to QRDQN, the gray one is rainbow.

So far I tried exactly replicating the environment from dqn_zoo by registering the custom gym environments (https://github.com/deepmind/dqn_zoo/blob/master/dqn_zoo/gym_atari.py#L36) but so far have not been successful in reproducing the IQN score (I can reach the exact same qrdqn score with stable_baseline3's QRDQN and the same hyperparams as in the paper) and also from looking at your baseline plots here: https://google.github.io/dopamine/baselines/atari/plots.html IQN is nowhere near the results as in the paper / dqn_zoo's IQN for the beamrider game.

So I have a couple of questions:

  1. Why is there such a large discrepancy between the IQN score of dopamine and dqn_zoo with regards to the beamrider game? I failed to figure out which gym config you use for your benchmark here, would also be highly appreciated if you could point me to the source.
  2. Why are there such large discrepancies between the paper and the code repos in general? IQN paper proposes a score of 42,776 for beamrider, dqn_zoo's implementation gets between 20k and 30k but nowhere near over 40k and dopamine's IQN settles around 7k. Both in the IQN paper as well as in the QRDQN paper beamrider should be over 30k in case of QRDQN, but in dqn_zoo QRDQN gets stuck at around 10k (which I could confirm with different QRDQN implementations).

Generally I have a hard time figuring out, which score's I should rely on when benchmarking my own implementation (to verify it's correctness), so far I found dqn_zoo to be the most promising source, since it's gym config is easy to replicate (compared to xitari of the original papers, which does not follow gym and is therefore hard to use with other implementations) and seems to be closer to the paper than the openai atari configs (e.g. NoFrameskip-v4 versions paired with the common env wrappers, see https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py modulo the FireReset wrapper which Deepmind did not use according to https://github.com/openai/baselines/issues/240).

hh0rva1h commented 1 year ago

Nevermind about the question to point me to the source of the gym config, I figured it out: https://github.com/google/dopamine/blob/a2753dae222c75ae991758d4110a84bc01c3215f/dopamine/discrete_domains/atari_lib.py#L70, so it seems contrary to dqn_zoo your prefer the NoFrameskip-v4 version of the respective game (despite sticky-actions being default in the lib, you explicitly disble it in the gin files to match the paper). Still, I'd like to be able to reproduce the paper scores, help would be highly appreciated here.

mgbellemare commented 1 year ago

Hi,

Does DQN Zoo use sticky actions? My understanding is no. The 7000 is with sticky actions, so that might explain the difference. (Maybe it's time someone ran a side-by-side comparison).

I can't explain the IQN paper vs DQN Zoo difference, although i would take DQN Zoo as the authoritative source.

Frame preprocessing (colour, cropping, etc.) can make differences that add up, unfortunately. If your agent is learning and achieving a score in the range of [Dopamine, DQN Zoo] I would assume you've mostly done things correctly. If you are trying to reproduce the algorithm perfectly - why not just use the DQN Zoo code?

mgbellemare commented 1 year ago

FWIW I've been told (but not verified) that DQN Zoo uses terminal on life loss and a larger eval epsilon, both of which would result in different performance than e.g. the Dopamine implementation.

hh0rva1h commented 1 year ago

Thank you very much @mgbellemare for sharing your thoughts and giving some important hints. Dopamine is indeed using a different config from the respective papers for the baselines plots (in case of iqn https://github.com/google/dopamine/blob/master/dopamine/agents/implicit_quantile/configs/implicit_quantile.gin instead of https://github.com/google/dopamine/blob/master/dopamine/agents/implicit_quantile/configs/implicit_quantile_icml.gin) that is different in the following sense:

  1. some hyper-parameters for the agent are changed to be comparable to the rainbow paper
  2. the sticky actions variant of the environment is used
  3. no terminal on life loss condition

From what I understand the _icml.gin config is supposed to replicate the setup of the paper (which is also the config OP used to replicate the paper) with the only difference being the 30-noop protocol which should not make a lot of difference according to https://github.com/google/dopamine/issues/37#issuecomment-489609875.

I compared the _icml.gin config to the respective config of dqn_zoo (https://github.com/deepmind/dqn_zoo/blob/master/dqn_zoo/iqn/run_atari.py#L47 see lines 47 to 79) and they seem to be mostly identical however I noticed two differences:

  1. dqn_zoo has exploration_epsilon_decay_frame_fraction=0.02 which should be 1 million frames, which corresponds to 250k steps, while implicit_quantile_icml.gin says RainbowAgent.epsilon_decay_period = 1000000 # agent steps. The papers talk about a 1 million frame decay, therefore the gin file should actually use RainbowAgent.epsilon_decay_period = 250000 (the same applies to dqn_nature.gin where this value is also not matching the publication), shouldn't it?
  2. dqn_zoo says flags.DEFINE_integer('tau_samples_policy', 64, '') while dopamine uses ImplicitQuantileAgent.num_quantile_samples = 32, but from what I understand here, this time dopamine is in accordance with the paper and dqn_zoo is not.

It would be really great to have your feedback here.