Why propagating rewards through Bellman backups is a problem?

google-research / planet

Learning Latent Dynamics for Planning from Pixels

https://danijar.com/planet

Apache License 2.0

1.18k stars 202 forks source link

Why propagating rewards through Bellman backups is a problem? #40

Closed piojanu closed 5 years ago

piojanu commented 5 years ago

Hi!

@danijar, could you expand on why propagation through Bellman backup is a problem? I can't understand it why it makes RL sample inefficient. The excerpt from the paper's introduction that touches it:

Planning using learned models offers several benefits over model-free reinforcement learning. First, model-based planning can be more data efficient because it leverages a richer training signal and does not require propagating rewards through Bellman backups.

Thanks for your time!

danijar commented 5 years ago

If you see a reward, it takes many iterations of training for a Q function to assign higher value to all preceding states. If you learn a reward function instead, it can directly learn the high reward for the one state and the planner can immediately find it.

piojanu commented 5 years ago

Oh, that's obvious now, thanks!