google-research / planet

Learning Latent Dynamics for Planning from Pixels
https://danijar.com/planet
Apache License 2.0
1.18k stars 202 forks source link

Only last mean score (return) is taken when one simulate session contains multiple episodes #20

Closed astronautas closed 5 years ago

astronautas commented 5 years ago

Hey @danijar,

It seems that the mean score (return) of one simulate session is only taken from the last episode, when one session contains multiple episodes. With every begin episode, the score is nulled, right?

Could you clarify either this is the truth, or the return is just the mean of every step's reward from 0 to max_steps? I was confused as some simulation session would yield 0 reward, which is highly unlikely to be a mean of e.g. 1k steps in Atari Assault-V0.

Thank you!

piojanu commented 5 years ago

Hi!

I'm currently working with PlaNet and maybe I could answer that question, but I don't really understand what you mean. Could you clarify using some example?

astronautas commented 5 years ago

Thanks! I want to understand how the points in the train return graph (first graph on the top left on tensorboard) are calculated.

I am running PlaNet on Assault-V0 (Atari). I noticed that multiple episodes get saved during one data collection phase (as my agent dies multiple times). Data collection phase is a session ran every config.collect_every steps from 0 to config.task.max_steps to collect data for training.

I wanted to be sure that the points in the return graph represent the mean score from 0 to config.task.max_steps instead of score of last episode of current data collection session.

danijar commented 5 years ago

The tasks I'm using have a fixed episode length. What is supposed to happen for variable length episodes is that the reported score is the mean return (not reward) of all completed episodes during this phase, i.e. the mean score for all but the last episode that didn't finish. The rewards for the last episode are just discarded. @piojanu, if you have tried running on variable episode lengths I'd be interested to hear if you can confirm.

piojanu commented 5 years ago

Hmm, interesting. I've thought that during simulation phase it collects one episode no matter how long it is. I need to dive into it. What implications it have on training, incomplete episode is just discarded as a whole (images too)?

astronautas commented 5 years ago

@danijar Thanks for the clarification! Is it the same case of mean return and dropping remainder with both the data collection and testing?

piojanu commented 5 years ago

@astronautas It seems the answer is yes. For data collection you can see that only full episodes are saved here (it answers my question too lines 462-467): https://github.com/google-research/planet/blob/9cd9abed5b9a8831388f4d9da16e5604cfbd7c20/planet/control/wrappers.py#L455-L468

And for the train phase as well the test phase the same model (and in turn summaries) definition is used (lines 160, 165, 172): https://github.com/google-research/planet/blob/9cd9abed5b9a8831388f4d9da16e5604cfbd7c20/planet/training/utility.py#L160-L176

So there inside is the same collection logic used which takes scores of finished episodes (line 84): https://github.com/google-research/planet/blob/9cd9abed5b9a8831388f4d9da16e5604cfbd7c20/planet/control/simulate.py#L59-L88 Note that score keeps sum of past rewards from the beginning of the episode (I think I correctly found the reset logic after each episode, that guarantees it, in the same file in lines 225-228) and done keeps True for terminating steps.

This way you retrieve only full episode returns and then reduce mean them in here (line 45): https://github.com/google-research/planet/blob/9cd9abed5b9a8831388f4d9da16e5604cfbd7c20/planet/control/simulate.py#L33-L56

danijar commented 5 years ago

Yes, it's the same for training and testing. Does that answer your question?