Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, James Davidson
Google Brain, University of Toronto, DeepMind, Google Research, University of Michigan
What is this
Proposed the Deep Planning Network (PlaNet)
a purely model-based agent that learns the environment dynamics from images and choose actions through fast online planning in latent space.
Proposed a multi-step variational inference objective that we name latent overshooting.
Showed that the agent solved continuous control tasks, partial observability, and sparse rewards problems using only pixel observations
Achieved close to or sometimes higher performance than strong model-free algorithms
Control: Model Predictive Control (MPC)
replan at each step (sounds computationally expensive)
Planning algorithm: Cross entropy method (CEM) to search for the best action sequence under the model
Why CEM?: "We decided on this algorithm because of its robustness and because it solved all considered tasks when given the true dynamics for planning"
"Model" in this architecture refers three thigs:
transition model $p(st|s{t-1}, a_{t-1})$
Gaussian with mean and variance parameterized by a feed-forward neural network
observation model $p(o_t|s_t)$
Gaussian with mean parameterized by a deconvolutional neural network and identity covariance
reward model $p(r_t|s_t)$
scalar Gaussian with mean parameterized by a feed-forward neural network and unit variance
and policy $p(a_t|o_t,a_t)$ aimes to maximize the expected sum of rewards.
Comparison with previous researches. What are the novelties/good points?
The robotics community focuses on video prediction models for planning (Agrawal et al., 2016; Finn & Levine, 2017; Ebert et al., 2018; Zhang et al., 2018) that deal with the visual complexity of the real world and solve tasks with a simple gripper, such as grasping or pushing objects.
In comparison, we focus on simulated environments, where we leverage latent planning to scale to larger state and action spaces, longer planning horizons, as well as sparse reward tasks
E2C (Watter et al., 2015) and RCE (Banija- mali et al., 2017) embed images into a latent space, where they learn local-linear latent transitions and plan for actions using LQR. These methods balance simulated cartpoles and control 2-link arms from images, but have been difficult to scale up.
We lift the Markov assumption of these models, making our method applicable under partial observability, and present results on more challenging environments that include longer planning horizons, contact dynamics, and sparse
Key points
Regarding recurrent network for planning, they claim the following:
our experiments show that both stochastic and deterministic paths in the transition model are crucial for successful planning
and the network architecture looks like Figure2 (c) which is called Recurrent state-space model (RSSM)
How the author proved the effectiveness of the proposal?
Experiments in continuous control tasks:
Cartpole Swing Up, Reacher Easy, Cheetah Run, Finger Spin, Cup Catch, and Walker Walk from DeepMind control suite
Confirmed that the proposed model achieved comparable performance to the best model-free algorithms while using 200× fewer episodes and similar or less computation time.
Summary
Link
Learning Latent Dynamics for Planning from Pixels
Official repo: google-research/planet
Author/Institution
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, James Davidson Google Brain, University of Toronto, DeepMind, Google Research, University of Michigan
What is this
"Model" in this architecture refers three thigs:
and policy $p(a_t|o_t,a_t)$ aimes to maximize the expected sum of rewards.
Comparison with previous researches. What are the novelties/good points?
Key points
Regarding recurrent network for planning, they claim the following:
and the network architecture looks like Figure2 (c) which is called Recurrent state-space model (RSSM)
How the author proved the effectiveness of the proposal?
Experiments in continuous control tasks: Cartpole Swing Up, Reacher Easy, Cheetah Run, Finger Spin, Cup Catch, and Walker Walk from DeepMind control suite
Confirmed that the proposed model achieved comparable performance to the best model-free algorithms while using 200× fewer episodes and similar or less computation time.
Any discussions?
What should I read next?
Broader contextual review: