Summary

Kurtland Chua, Roberto Calandra, Rowan McAllister, Sergey Levine UC Berkeley

What is this

"We propose a new algorithm called probabilistic ensembles with trajectory sampling (PETS) that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation."
- "Model-based reinforcement learning (RL) algorithms can attain excellent sample efficiency, but often lag behind the best model-free algorithms in terms of asymptotic performance. "
- "Our comparison to state-of-the-art model-based and model-free deep RL algorithms shows that our approach matches the asymptotic performance of model-free algorithms on several challenging benchmark tasks, while requiring significantly fewer samples (e.g., 8 and 125 times fewer samples than Soft Actor Critic and Proximal Policy Optimization respectively on the half-cheetah task)."

"While a number of prior works have explored uncertainty-aware deep neural network models [Neal, 1995, Lakshminarayanan et al., 2017], including in the context of RL [Gal et al., 2016, Depeweg et al., 2016], our work is, to our knowledge, the first to bring these components together in a deep MBRL framework that reaches the asymptotic performance of state-of-the-art model-free RL methods on benchmark control tasks."
- "these components" == ensembling and outputting Gaussian distribution parameters

Aleatoric uncertainty
- Arises from inherent stochasticities of a system (e.g. observation noise and process noise)
- Aleatoric uncertainty can be captured by outputting the parameters of a parameterized distribution
Epistemic uncertainty
- corresponds to subjective uncertainty about the dynamics function, due to a lack of sufficient data to uniquely determine the underlying system exactly.
- In the limit of infinite data, epistemic uncertainty should vanish

スクリーンショット 2021-10-28 10 56 40

Comparison to state-of-the-art model-based and model-free deep RL algorithms
- Showed that "our approach matches the asymptotic performance of model-free algorithms on several challenging benchmark tasks, while requiring significantly fewer samples (e.g., 8 and 125 times fewer samples than Soft Actor Critic and Proximal Policy Optimization respectively on the half-cheetah task)."