Intuition about hyper parameters for Atari games

piojanu commented 5 years ago

Hi!

I decided to put Sokoban off for now as it doesn't seem to work the same as in World Models. So I started to experiment with Atari once again. I easily made World Models work with e.g. Boxing environment, but in PlaNet I get very blurry open loop predictions: Please, could you share your intuition about hyper-parameters I might try to tune? I tried to disable overshooting (to make the case more similar to World Models) and now I'm looking into divergence scales, but without luck for now. As I have no cluster to run hyper-params optimisation, maybe you have some ideas what is worth focusing at? :D

Thanks!

ghost commented 5 years ago

At least for me, 50k train, 1.5k test and collect_every 8k works for Assault-V0 with 1.2k max length bound. The agent learns somewhat decent behaviour (~300 score per episode) after 73 episodes of ~800 steps (will post more info after some longer training time). It learns predictions as well, they are blurry, but decent. Note that Atari games already do action repeat, thus I disabled this in the PlaNet's source code.

astronautas commented 5 years ago

@piojanu The predictions don't seem that blurry (I do see agents fighting). Take a look again into World Models paper - their predictions look blurry as well. I suggest focusing on the test performance to see whether the agent actually learns correct behaviour, which is the goal.

piojanu commented 5 years ago

Thanks guys, I still work on that. Nevertheless, PlaNet continuous control tasks had really sharp reconstructions even after 15 steps. Furthermore, I've implemented World Models and trained it for Boxing. Here are the open loop predictions I was able to obtain there: As you can see those are MUCH clearer. Do you have other ideas?

astronautas commented 5 years ago

@piojanu Interesting, they do seem much clearer. That's a dimension to explore though, maybe the Variational Auto Encoder of World Model's is bit better.

piojanu commented 5 years ago

Also: @astronautas the agent performance is weak too. Please keep your responses more succinct. What do you mean "maybe the VAE of World Model's is bit better"? @Kwander thank you for noting that OpenAI Gym already does action skipping. I'll disable it too!

piojanu commented 5 years ago

Closed by mistake, sorry.

Lowering global prior divergence scale seems to help. I'll disable it and run the test again ;)

astronautas commented 5 years ago

@piojanu What do you mean by "... the performance is weak too"? I've edited my answer.

piojanu commented 5 years ago

He gets on average below 0 points. In World Models it was 18 points on average, but planning algorithm is different so it's not fair to compare. Please lets focus on prediction metrics (reconstruction in that case) as it's the main focus of my question.

astronautas commented 5 years ago

What's the batch you're using? Are you using the same one as the author used?

piojanu commented 5 years ago

@astronautas that's funny a little, expanding VAE to Variational Auto Encoder doesn't add value I was thinking about 😄 I use default batch size (50, 50).

astronautas commented 5 years ago

Sorry, the author did not use the term vaes in the paper at all, Variational encoder was the definition used :D I meant the state into latent space encoder is worth looking at.

piojanu commented 5 years ago

@Kwander do you observe this too:

Please note the scale of y-axis. Those values are barely changing e.g. graph/summaries/closedloop/posterior/log_prob/image stays pretty much the same all the time. Is it expected? For me it's sign that it's not learning much beyond first steps before first summary gets written.

astronautas commented 5 years ago

@piojanu I have noticed this as well, for dmcontrol cartpole swingup problem (Kwander was my another temporary work account, so you can talk to me directly :D).

piojanu commented 5 years ago

Okay, thanks for clarification ;) Do you use default config in "Assault-V0"? Can you post your open loop reconstructions and agent's score to see what to expect?

EDIT: Also please note that log_prob reward is two orders of magnitude lower then the image log_prob. Isn't that a problem and should be counteracted with reward scale? @danijar could you join to the discussion?

danijar commented 5 years ago

@piojanu By default we use a decoder variance of 1, which means the model explains a lot of variation in the image as random noise. While this leads to more robust representations, it also leads to more blurry images. If the predicted images are all the same, the posterior collapsed because the model explains everything as observations noise. Try to reduce the decoder variance in conv_ha.py or equivalently set a lower divergence_scale parameter.

@astronautas That's a cool result. I assume you're still using CEM with a Gaussian belief over action sequences for planning on even though the Atari actions are discrete, and later take the argmax along the action dimension?

astronautas commented 5 years ago

@danijar Thanks, yes, I am using those. Instead of argmax, I let the PlaNet learn to predict one action between -1 and 1, which I later map to my Atari action space from 0.0 to action_space.n and floor to get the action value.

piojanu commented 5 years ago

Thanks guys. Today I'll run hyper-param tuning and leave it for ~3-4 days. I'll get back to you with results (and even more questions, haha 😊) in here so let's leave this issue open.

piojanu commented 5 years ago

Quick update: I'm working on the hyper-param tuning. I wasn't able to run it during the last weekend. I run it yesterday though.

astronautas commented 5 years ago

I did several more runs on Assault and Breakout Atari as well! I'll update my comment in the evening with the results.

The algorithm seems to perform worst on Breakout game, although it's probably the simplest one for model-free algorithms (no early conclusions, will share the results first). It cannot predict small ball movement after ~150 episodes, will share the results.

piojanu commented 5 years ago

@danijar check me if I understand correctly what is happening.

Consecutive frames in Boxing are quite similar so the decoder explains it as a random noise. It means that the posterior starts to infer similar codes for consecutive frames (this is what you mean by "the posterior collapse", right?).
The transition model (prior) is trained to minimise KL-divergence to the posterior. The posterior tells it that consecutive frames have (almost) the same codes, so the transition model doesn't learn anything and open loop predictions are more or less the same blurry blob.

Now, lowering the variance of the decoder should result in sharper reconstructions, because a blurry blob is less probable under the lower variance i.e. MSE loss is higher for blurry recostructions and the posterior needs to encode concrete information about what is in the frame so the decoder can use it (i.e. it needs to stop encoding bull**** 😄).

The other way is to lower the divergence scale, so one-, two-, three-... step predictions (that will have the same or very similar codes as I explained above) will have lower impact than long-term predictions (which should differentiate the frames better, because frames further apart have more differences). Am I right?

One thing that I don't understand is why you called the two approaches "equivalent":

Try to reduce the decoder variance in conv_ha.py or equivalently set a lower divergence_scale parameter.

Those seems to me to address the problem in fundamentally different ways and maybe both changes will need to be applied. Could you expand on why those are equivalent?

danijar commented 5 years ago

I suggest to reduce the divergence_scale (I'd try values around 1e-2 or 1e-3) and to increase the action repeat. The action repeat will result in a bigger difference between consecutive frames and thus more signal for the model to learn from, that cannot easily modeled as noise as you explained.

Divergence scale and the (constant and scalar) decoder variance are the same. You can see this by writing the ELBO for a Gaussian decoder in the standard form E_q(z)[lnp(x|z)]-KL[q(z) || p(z)]. The log-likelihood terms is lnp(x|z) = -0.5(x-f(z))/std^2-lnZ. Multiplying the ELBO by std^2 removes it from the log-prob term and puts it in front of the KL term as in beta-VAE. The objectives have different values because of the Gaussian normalizer Z but they have the share the same gradient since the normalizer is a constant.

astronautas commented 5 years ago

So here's some result when training to play Breakout-v0:

Episodes: 170
Average test performance: 4
Action repeat (probability): 0.25 <- Atari default.
Batch size: (25, 50).
Else default.

Sadly, I accidentally deleted the log directory, thus I cannot share images. Here's what I recall:

graph/summaries/closedloop/posterior/image/animation/gif did not even contain the main moving object: the ball, in any of the predicted frames

Note: I am mapping [-1, 1] interval to [0, action_space.n] and doing math.floor to get a discrete action. E.g. if the network predicts -0.1, it gets mapped to ~2.95, which gets floored into action 2.

Things yet to try:

Sampling actions from a discrete Gaussian in planning to avoid the mapping above.
Trying out different divergence rates, as suggested.
Disabling frame skipping and Atari stochastic action repeat.

astronautas commented 5 years ago

EDIT: I did another run with this configuration: params.batch_shape = [24, 50] params.train_steps = 40000 params.test_steps = 1000 params.collect_every = 20000 divergence_scale = 1e-3

default

Max_length: 1200, maximum_episode_duration: 600.
Episodes: ~250.
Action repeat: 8.

Results: graph/summaries/closedloop/posterior/image/prediction/image/0

The weirdest thing about this is that it does seem to only degrade over time.

graph/summaries/closedloop/posterior/image/prediction/image/0:

graph/summaries/closedloop/posterior/image/animation/gif

Now, it seems that lowering divergence scale and increasing action repeat makes images more different. At least the ball is now predicted, although not in all predictions.

I have a simple question for you @danijar:

Am I correct that all the loss/zs_entropy/zs_divergence terms should be gradually decreasing throughout the training?

piojanu commented 5 years ago

@astronautas Here you have the graphs from training simple VAE on MNIST dataset (I've implemented it in TensorFlow Probability using Danijar coding patterns from his blog to better understand PlaNet summaries and codebase ~and take over his researcher identity~. I start to feel like some creep/crazy fanboy 😮 @danijar do you or PlaNet have fanpage? haha 😄):

So, let's get serious, my intuition is that KL-divergence from a global prior (which is a normal distribution) should get higher as the posterior is trained to encode useful information. At the same time, it makes an entropy of the posterior lower (it's more and more certain about the latent state (code) as it learns). On the other hand, KL-divergence between the posterior and a prior from a transition model should get lower as the transition model learns to predict future latent states (codes) more and more accurately. loss (negative ELBO) needs to get lower, but all the log prob. needs to get higher. @danijar please confirm that it holds for PlaNet. Also, I still don't understand if it's correct that summaries has so small ranges in y-axis (see my comment above). @danijar also thank you for your explanation about beta-VAE, I didn't catch that.

Also, @astronautas please post openloop and closedloop/prior predictions as those are interesting too (or even more interesting). closedloop/posterior in 0-step prediction (encode and decode the frame), closedloop/prior is 1-step future prediction (+ decode) and openloop is n-step future prediction (+ decode). I start to gather interesting results from my hyper-param tuning for Boxing. I'll post the results have tomorrow or next week after Easter holidays.

astronautas commented 5 years ago

Here's some results from my latest run (divergence scale 1e-4, learning rate 0.5e-3):

Openloop:

Closedloop posterior: closedlooprior

Closedloop prior: closedlooprior

Entropy: entropy

Correct if I am wrong guys but the posterior entropy should be decreasing, and log_probs should all be increasing, am I right?

Thanks!

danijar commented 5 years ago

The entropy and losses don't necessarily have to decrease during training, because the agent is collecting new data that could be surprising. I would mostly look at the openloop images summaries and the predicted reward trajectories (another image summary).

@piojanu I don't have a fan page, unfortunately :D I'm not sure if your intuition is correct regarding the KL. the KL to the global prior may be roughly constant throughout training.

astronautas commented 5 years ago

@danijar Good tip, here are the predicted reward screenshots:

rewards

It makes sense why the performance is so poor - the planner is not capable of successfully predicting reward spikes, thus choosing good corresponding actions.

Question: the RewardT (reward at timestep t) is generated via the reward model using the HiddenStateT, while the hidden state is generated via the transition model. The graph probably means that the transition model performs poorly? (i.e openloop would validate this, right?). There could be a possibility that the current state is inferred poorly from past observations, ruining the planning forward. (closedloop images would validate this, right?)

Concerning the openloop, how should one interpret the images? Does it follow the same structure as it was in the paper:

openloop

danijar commented 5 years ago

Yes, the open loop images have 5 context frames and all other frames are predicted open loop. For the paper, we additionally only showed every 5th frame to have it fit on a page. The TensorBoard summaries don't skip any frames.

It seems like the reward predictions are not very accurate (yet). How long did this train for? I would expect it to take at least a couple of million steps. One million steps means 200 episodes since it collects an episode every 5000 steps.

astronautas commented 5 years ago

Thanks @danijar,

"... all other frames are predicted open loop " - you mean even rows are observations and odd rows are reconstructed observations from planned latent states? I am not sure I understood what the context frames signify though... :confused:

I might have missed this detail. What's the horizon when training to predict future states? Is it always config.batch_size.shape[1] (i.e. 50)? Yet the planner uses only 12 steps into the future due to performance limitations?

I trained it for about 6 hours and 800k steps (370 episodes).

The problem with these Atari games is that more than one episode is collected in a single collect session, as the "done" signal is issued when the agent loses all lives. I was thinking maybe of writing a wrapper to reset the game when the agent runs out of lives to solve this or collect data till first done is fired.

piojanu commented 5 years ago

@danijar could you share your insight about KL-divergence to global prior then? @astronautas I think episodes are reset after "done", it's here: https://github.com/google-research/planet/blob/9cd9abed5b9a8831388f4d9da16e5604cfbd7c20/planet/control/simulate.py#L225-L232 @danijar could you confirm that?

astronautas commented 5 years ago

@danijar @piojanu I have some new Breakout results after fiddling with the divergence_scale.

With high divergence scale, ball 0 step reconstructions fail. With the reduced rate to 1e-4, it mostly succeeds at 0 step reconstruction, though hallucinates incorrect future ball positions or incorrectly infers ball movement direction. Not always though as you can see, yet it's not good enough (the score is ~4).

Sometimes it does succeed. Last row of the image below - in the 6th frame, it confidently chooses a corrective action. Still, it is kind of "shy" about predicting the ball movement.

Without solving this, I doubt it can infer realistic rewards leading to good actions. The problem I sense with these games is that reconstruction need to be very accurate as the dynamics are fast.

I'll checkout how model-free algos do on this task and verify whether we're expecting too much of PlaNet.

openloop

danijar commented 5 years ago

The fact that the ball is only a few pixels large might be a problem, with a loss that treats all pixels equally there is little motivation for the model to learn to predict it -- it's much easier to get the loss down by getting really good at predicting the background. Lowering the divergence scale should help a bit as you said. You can also try to reduce the color depth further, e.g. to image_bits: 3. At the same time, it might be easier to start with an environment that doesn't have important but tiny objects.

astronautas commented 5 years ago

Thanks @danijar for the tips. I am still eager to maximize PlaNet's accuracy on Breakout - lots of games contain small yet important details which should not be ignored.

I am intrigued by the World Model's openloop results @piojanu shared. This probably suggests that it's worth fiddling with the loss functions as well. Thanks for keeping this discussion alive. Hopefully, this will provide some insights to you as well about where PlaNet performs well and what it still lacks.

2 short questions:

What's the difference between global_divergence_scale and divergence_scale hyperparameters?
Prior and posterior terminology in the code (I got bit confused, as the terms are different in the paper): for example, let's say we have overshooting.py code snippet:

(prior, posterior), _ = tf.nn.dynamic_rnn( cell, (embedded, prev_action, use_obs), length, dtype=tf.float32, swap_memory=True)

In this piece of code, posterior is the h_t+1 and prior is s_t, yes? While "cell" is h_t, prev_action is a_t and embedded is o_t?

danijar commented 5 years ago

@astronautas The global_divergence_scale is for the fixed global prior and the divergence_scale is for the temporal prior predicted by the model. The prior and posterior are the same as described in the paper, prior means p(s_t | s_t-1, a_t-1) and posterior means q(s_t | s_t-1, a_t-1, o_t). The state s_t is a dict that contains both the deterministic GRU state and the stochastic state.

I've found that I can train on the MuJoCo tasks without overshooting, global prior, and reward scale, by setting future_rnn: True and switching the activation functions from tf.nn.relu to tf.nn.elu. We'll post an update to the paper at some point but if you like you could already try and see if this works better on Atari as well. Another thing to try to help with small objects is to set free_bits to a larger value.

astronautas commented 5 years ago

@danijar Thanks! If we disable overshooting though, does the loss still include training of priors against posteriors (i.e. KL (p(s_t | s_t-1, a_t-1); q(s_t | s_t-1, a_t-1, o_t)) ? It looks like this is disabled as we disable latent overshooting loss, though correct me if I am wrong.

danijar commented 5 years ago

There will still be a KL term from each posterior to the temporal prior predicted from one step before. This is the term from the standard ELBO that is the only term left when setting overshooting: 0 and its scale is controlled by divergence_scale.

astronautas commented 5 years ago

Thanks for clarifying this @danijar. I got confused, as in the paper it seemed to be included in the 3rd equation, yet it wasn't there anymore in the final 7th equation. But maybe it's just me ¯_(ツ)_/¯. Or is the D = 1 parameter in the 7th one == no overshooting?

danijar commented 5 years ago

Exactly, with D=1 latent overshooting reduces to the standard ELBO. It's just a bit confusing that the overshooting parameter in the code is not D and D-1, such that overshooting: 0 means standard ELBO.

piojanu commented 5 years ago

What is future_rnn: True and free_bits?

astronautas commented 5 years ago

Thanks @danijar, it's clear now!

I haven't found _freebits parameter in the source code as well @danijar @piojanu.

danijar commented 5 years ago

Sorry, it's called free_nats and means the model is allowed to use this amount of nats without KL penalty, a trick that's often used for static VAEs. It helps the model focus on smaller details which don't contribute much to improving the reconstruction loss. The future_rnn flag fixes a somewhat somewhat subtle bug in the RSSM code, where RNN and stochastic state were both used but didn't interact with each other at future steps.

astronautas commented 5 years ago

@danijar Thanks for the info, I'll try these!

I am also curious, what's the use of the observation model? I've read it's only used for training as a "signal", which intuitively sounds valid, as we want latent states to contain visual information. Could we though train only the reward model, so that the latent states would be more "focused" on encoding the reward relevant visual information?

TLDR; What was the motivation behind including the observation model for training although it's not used for planning? Wouldn't it be better to train one good reward model instead?

danijar commented 5 years ago

I've experimented with learning only reward models a while ago. This works in surprisingly many cases, however, it doesn't work well when there are sparse rewards, as in cup catch. Moreover, it's a bit less data efficient, but still much more data efficient than most model-free approaches. I think the effect of the observation model is comparable to the effect of auxiliary losses in UNREAL, etc. But I also want to move towards agents that learn without reward, which makes predicting sensory inputs more important.

Kaixhin commented 5 years ago

@danijar thanks for all your hard work open-sourcing this. I'm working on a PyTorch port and it was fundamental in working out the architecture (even though the paper was very clear about most other aspects). Just to confirm, your reported results were without fixing the future_rnn bug? I've managed to replicate your results (i.e., overshooting is important) having fixed that in my code on the walker-walk environment, so any comments of the importance of that vs. replacing all ReLUs with ELUs?

FYI don't think I spotted any other oddities in your code. Unfortunately I'm looking at almost 2 weeks per run atm and a not-insignificant amount of memory usage, so will have to look into possible speedups, but the algorithm in general is quite sequential so not sure much can be done.

danijar commented 5 years ago

@Kaixhin That's great! The results where without the fixed bug and I'll update them in the paper. There are a few other small updates to the paper that I want to make (add some citations, etc), so it might take me a couple of days. I've confirmed that elu is not the deciding factor.

I'd be great to see some plots comparing with and without overshooting using your implementation, if you have them. Maybe you could open a new thread for everything related to your PyTorch rewrite? I also have some ideas to speed up your code that we can discuss there.

danijar commented 5 years ago

Hi @piojanu and @astronautas, a few updates from my side:

It turns out when I added the future_rnn parameter in the open source repo, I passed it to SSM instead of RSSM in configs.py. This is fixed now.
The updated agent uses tfd.Normal(mean, 1.0) distributions instead of the custom MSEDistribution I used before in networks/basic.py and networks/conv_ha.py.
I've also confirmed that ELU is not important so I've switched back to ReLU. It only made a different for a follow-up project I've worked on recently.

I think this could be worth another try for Breakout. I'll also sync all my changes to Github at some point for the camera-ready version of the ICML paper, but it could take a few weeks until I get to it.

piojanu commented 5 years ago

@danijar thanks for your update! I need to catch up with your posts. However, here are my results from random search of overshooting (from 0 to 50), divergance_scale (from 1E-5 to 1E-2) and global_divergence_scale (from 1E-5 to 1E-2) parameters. Every data point is from Freeway training for 1M steps. Those three scores are last three logged test scores in TensorBoard: https://docs.google.com/spreadsheets/d/1VCQRyFVZzzbSE_LPpNIvu_QnzFSx9tV8g99mfVaK7gw/edit?usp=sharing Well... for me it's some random noise :/ It seems like really low overshooting makes things worse, but I have other experiment where with no overshooting and global divergence set to 0.01 it does much much better (score ~16) after ~2M steps of training so it seems like random chance. My conclusion is it doesn't matter that much if those scales are 1E-3 or 1E-5. What I can see clearly is that lowering those scales from default in deed helps. Also overshooting doesn't seem to have obvious advantage as well. @astronautas and @danijar could you look at this and tell what do you thing?

EDIT: Quick question, do you use confidence intervals and hypothesis testing when analysing e.g. hyper-param tuning results? I plan to add those tomorrow.

astronautas commented 5 years ago

@danijar I'll give it a shot with the new configuration on VizDoom, as I've wanted to do this from the very beginning. Thanks for the fixes!

@piojanu Thanks for sharing the results! Looks like big overshooting coupled with small global prior is needed, coupled with somewhat low, but not too low divergence scale. Enforcing long-range prediction consistency with low (divergence) scale seems to have helped. Lowering the global divergence scale makes model distributions less Gaussian, which seems to have helped as well.

piojanu commented 5 years ago

I've also run training with future_rnn: true, divergence_scale: 0.0001, global_divergence_scale: 0.0001 parameters for Boxing and Freeway. I intent to leave it to train for longer time. We will see what will turn out from that ;)

@astronautas what do you mean by "as well as reducing the dependency on forcing models to be Gaussian-like"?

danijar commented 5 years ago

I don't think lowering the divergence scale (normal or global) makes anything more or less Gaussian. The posteriors are conditionally Gaussian already. Note that the trajectory distribution is highly non-Gaussian because of the non-linear transformation at each time step. So starting from a Gaussian, the next time step is already a mixture of infinitely many Gaussians. One reason that lowering the divergence scale can help is that it allows the model to absorb more information from its observations by loosening the information bottleneck.

google-research / planet

Intuition about hyper parameters for Atari games #21