google-research / planet

Learning Latent Dynamics for Planning from Pixels
https://danijar.com/planet
Apache License 2.0
1.18k stars 202 forks source link

Intuition about hyper parameters for Atari games #21

Closed piojanu closed 4 years ago

piojanu commented 5 years ago

Hi!

I decided to put Sokoban off for now as it doesn't seem to work the same as in World Models. So I started to experiment with Atari once again. I easily made World Models work with e.g. Boxing environment, but in PlaNet I get very blurry open loop predictions: image Please, could you share your intuition about hyper-parameters I might try to tune? I tried to disable overshooting (to make the case more similar to World Models) and now I'm looking into divergence scales, but without luck for now. As I have no cluster to run hyper-params optimisation, maybe you have some ideas what is worth focusing at? :D

Thanks!

astronautas commented 5 years ago

@danijar Yup you're right, it makes sense :).

One question - in Atari games, it crucial that the agent does not learn it's good to die to get a reward after environment is resetted. Are the episodes concatenated during training, or is the training done for each episode separately? Could there be training chunks containing <states,action,reward> tuples from different episodes?

danijar commented 5 years ago

All episodes are trained separately in the code, so as long as the episode terminates when the agent dies, this is not a problem.

piojanu commented 5 years ago

I'm back with good news :)

TL;DR

Atari Freeway and Crazy Climber seems to start working. I get sharp open loop reconstructions, but still with some random "glimmer" of the chicken in case of Freeway and the building in case of Crazy Climber. Boxing also starts to work (posterior doesn't collapse any more)!

Details

Hyper-params (if not stated otherwise):

Boxing

Experiments: I've run those for ~4.3M steps which is around ~870 episodes of Boxing.

  1. action_repeat: 8 - this didn't help with collapsing posterior, only made agent worse. Final score is below 0 which is worse than random agent which scores around 0. Before this higher action repeat experiment it was around 0 too, so it acted like random agent.
  2. free_nats: 4 and action repeat from above - higher free_nats helped a lot with collapsing posterior! Final score like above.

Openloop and rewards predictions: image It still doesn't look great, but at least boxers aren't turning into blurry blob!

image Rewards prediction doesn't seem to work. Any ideas how to improve it @danijar and @astronautas?

Next actions: I've run it again with standard action_repeat: 4 and free_nats: 4 for longer.

Freeway

Experiments: Hyper-params listed at the top. Here I've verified that future_rnn: true helps a lot. I've run it for ~4.2M steps or ~850 Freeway episodes.

Openloop and rewards predictions: image In gif animation it can be seen even better that cars are precisely modelled and sharply reconstructed. However, chicken is reconstructed somehow randomly up and down. This needs to be improved.

image Well, here rewards are really sparse, so there is not much signal to learn from to predict rewards.

Next actions: I've run it with free_nats: 3 with hope that chicken will be modelled better. Also there is problem with rewards prediction to solve, any ideas guys?

Crazy Climber

Experiments: Again, hyper-params from the top. I've run it for ~3M steps which is ~620 Crazy Climber episodes.

Openloop and rewards predictions: image Reconstructions are sharp, but building structure isn't modelled well (e.g. this black hole in the middle usually is much higher/lower then reconstructed). It can be seen better in gif.

image Again, it didn't train to predict rewards right.

Also, the agent have problems with moving up (some combination of actions is needed to raise a hand, then grab a wall and then pull up). Random agent does better. It seems like a problem with exploration, what do you think?

Next actions: I've run it with disabled global prior to check if it changes anything. @danijar found out that it isn't important with fixed RSSM, see #28.

Conclusions

I suspect that rewards aren't predicted too good and that's why it does so purely (in case od Crazy Climber it does worse them random, in this paper they have nice benchmarks and comparisons e.g. table 1. Their algo seems great too). Other possibility, which I'll check, is poor exploration. We also need to remember that CEM isn't designed for discrete actio-space and maybe hacky way of argmaxing sampled action scores isn't the best idea (maybe we should use something softer, e.g. sampling from proportional/softmax dist. of those action scores?).

Im interested in what do you thing @astronautas and @danijar? What next steps can you see?

EDIT: One more insight to argmax policy which I implement in Gym environment wrapper: PlaNet takes action scores as an input to a transition model, not a discrete action. Small change in action scores e.g. from [0.70, 0.69] to [0.70, 0.71] in case of argmax policy gives completely different action wheres PlaNet sees little change in scores. This can make it harder to model agent's behaviour, which might result in chicken random jumps up and down (because PlaNet missrecognise which action was taken). Solution would be to modify CEM algorithm to return one hot vector for chosen action, not action scores. PlaNet would learn action embeddings then.

piojanu commented 5 years ago

Quick updates:

  1. I've implemented this CEM modification to discrete action-space (one-hot actions + e-greedy exploratory policy in mpc_agent.py, not additive noise as in continuous action-space case). Testing it now in Freeway and Crazy Climber.
  2. free_nats=3 helped with Freeway! Although there are still some errors in prediction, chicken moves are smooth and stable (don't break env dynamics with e.g. teleportation). I'm truly amazed right now and I start to believe it might work 😄 image

I'll post more probably tomorrow.

astronautas commented 5 years ago

@piojanu Did you try running same experiment with and without the discrete planner? Does it really improve the results?

piojanu commented 5 years ago

@astronautas sorry, I've edited my post. I'm testing it now ;)

piojanu commented 5 years ago

I have new results and I hope for your insights @astronautas and @danijar.

Freeway - discrete planner experiment

Well, the agent have hard time getting to the other side of the road (where it gets a reward of 1). I've issued multiple runs with epsilon in e-greedy policy 0.3 and 0.7 and in three from four runs it didn't collected any reward (in data collection phase and in test phase): image It's even worse than without this discrete planner, I'll try to find out why. If it comes to transitions modeling, then results aren't clear too. In 2 runs it models the actions really well in other 2 runs the chicken is jumping unpredictably. So I'll now try with disabling global prior. Do you have some thoughts?

Boxing - higher free nats and lower divergence scale

Still fighting, currently I tested divergence scales 1E-4 and 1E-5 with free nats 4 in multiple runs. Reconstructions are still looking like some randomly jumping blobs, EVEN for closedloop prior reconstructions (one-step prediction) in case of divergence scale 1E-5. So lowering the divergence scale further doesn't seem to make much sense. I'll try with higher free nats (5). It's worth noting that closedloop posteriors looks fine, so it's still some problem with the transition model I bet. @danijar do you have some ideas? Here are closedloop predictions for 1E-4 and then 1E-5: image image

And here are openloop predictions for 1E-4 and then 1E-5: image image

Crazy Climber - higher free nats

Nothing new, reconstructions are still not precise. I wan't post anything more as it looks much the same as in Boxing case. There are little differences between the frames and the transition model doesn't seem to get that. @astronautas any ideas how we can work that out?

piojanu commented 5 years ago

I made Boxing work really well! :D Score is high and open loop predictions are "sharp".

image image

TL;DR ~7M steps; disabled global prior; enabled future rnn; used "discrete" CEM described above; divergence scale 3E-02; free nats 12; batch size [20, 50];

Hyper-params

The most important parameters seems to be free nats and divergence scale (and turning on future rnn). I did some tuning with many parameters and no other seem to have that big impact (I didn't try to change reward scales though). Random search results are (with only 1M steps!): the lower divergence scale the nosier image predictions are. The higher free nats the better actions and movement predictions in images are (more stable one could tell). Although, there were quite good reconstructions for low free nats too (when divergence scale was a bit higher or equal 1E-02). I'll try it next.

Questions

@danijar what might cause so dramatic leap in image loss/log_prob statistics around step 2M? See the diagrams: image image image image image image image image image image image

Reward open loop log_prob also started to decrease and then increase in the same range: image

Other reward statistics doesn't have this clear pattern: image image

Also in this interval it was scoring less and then it started to score more and more: image

So can it be that reward model started to learn at the expense of observation model and then both found balance again? Because I don't see much difference in collected dataset around 2M step (although, I only looked at those images of dataset in TensorBoard and it's not much).

Further work

Now I try with lower free nats in Boxing. Also, I'm working on Freeway and MsPacman. Stay tuned!

danijar commented 5 years ago

That's great news! It could be a numerical instability or just that the agent has discovered a new part of the environment that is more surprising. I would suggest to look at the video predictions right before and after this happens (to the degree TensorBoard allows). The reward scales are not very important as I've found. Have you tries disabling overshooting yet?

Besides this, I would recommend a divergence scale that is as high as possible while still allowing for good performance. For example, when you set the divergence scale to zero it could learn to become a deterministic autoencoder which and reconstruct well but is less likely to generalize to state in latent space that the decoder hasn't seen during training.

piojanu commented 5 years ago

@danijar thanks for your insights!

  1. I didn't try disabling overshooting yet (well, I did once at the very beginning of playing with PlaNet and it seemed to help, but this was before many other tweaks, so this experiment needs to be redone). I can't understand how overshooting might make things worse? It seems like great idea to calculate loss over future predictions (this is what we really care about at the end, to predict future well, not only one step ahead).
  2. I've looked into image reconstructions before, during and after this "pit". Before it reconstructions were quite good, then in the pit there is noting but green deck with barriers (?!) and then after the pit reconstructions are amazing. Strange 😮 Second run (I run it more times to confirm results) doesn't have this pit, but it just got to the point when score should start to rise so I don't know just yet if this will work.
  3. Yea, I even saw it in my experiments that low divergence scale yields really noisy future predictions. Now I'm running Boxing with 3E-02 and Freeway with 8E-03 so the best parameters from Random Search, but I'll try with higher scale for Freeway too (same as in Boxing).

I'll get back with results of course :) I'm now more optimistic that it can work :D I also want to start experimenting with other planners e.g. I want to train and incorporate value function besides reward model. It should help in environments with sparse rewards (like cup catch and Freeway) where planning horizon might be too short to see any reward at all. At the end, my goal is to use Silver's Dyna 2 from TD-Search paper. Other benefit of using latent state model (that was mentioned in World Models, but I didn't see it in PlaNet paper) is that model creates great problem representation that then makes RL part (learning/planning policy) much much easier!