Closed piojanu closed 5 years ago
If you see a reward, it takes many iterations of training for a Q function to assign higher value to all preceding states. If you learn a reward function instead, it can directly learn the high reward for the one state and the planner can immediately find it.
Oh, that's obvious now, thanks!
Hi!
@danijar, could you expand on why propagation through Bellman backup is a problem? I can't understand it why it makes RL sample inefficient. The excerpt from the paper's introduction that touches it:
Thanks for your time!