Baselines & Selling Points

Why Model-Based?
- It's possible to be more data efficient although model-free might have better asymptotic performance
- Models allow easily injecting inductive biases
What about other generative models?
- VAE: taking gradients is not straightforward
- Denoising AE
- Normalizing Flows
What about other planners / policy optimizers that use diffusion?
- Janner's approach of directly doing diffusion on the trajectory level
- MPPI
What if we don't include distribution risk / uncertainty?
- Policy Gradient with vs. without distribution risk.
- MPPI / SGD for planning problems
What about other approaches that tackle similar distribution risk problems?
- Compare against MOPO with includes ensembles and variance.

A quick summary of my understanding on MOPO:

High-level idea: for the true reward on the actual environment T, define a lower bound of this true reward on the learned model T̂. This lower bound can be computed by solving maximizing the modified one-step reward r̃(s, a) = r(s, a) − λ u(s, a) on the learned model T̂.

Pros of MOPO:

A strong theoretical guarantee. The maximization objective is an lower bound of the true objective.
The gap between this lower bound and the true objective is smaller where the transition model T̂ is more accurate, and larger when the learned model T̂ is accurate. Hence the agent would pay a price to exploit a state/action tuple where the transition model is less accurate, so as to balance between exploration and exploitation.

Cons of MOPO:

The lower bound can be very loose.
Computing this loose bound requires computing the total variational distance between T(s, a), and T̂(s, a), which we don't have.

As a comparison, our proposed regularization term −λ log p(s, a) shares the the second advantage (penalizing the total cost if the model exploits unseen state/action pair), but it doesn't require computing this regularization term through a loose bound, and it is easy to approximate gradient of this regularization term through diffusion.

Eventually we will compare our approach against MOPO, and the comparison metric is the true objective ∑ γᵗ r(s, a) on the real environment, when both MOPO and our system use the same training data.

But MOPO uses an ensemble of models to estimate the error, I don't immediately see why we have to use an ensemble of models. If we compare our approach with one dynamics model versus MOPO with an ensemble of models, is that still a fair comparison?

hjsuh94 / score_po

Baselines & Selling Points #2