Open hjsuh94 opened 1 year ago
A quick summary of my understanding on MOPO:
High-level idea: for the true reward on the actual environment T
, define a lower bound of this true reward on the learned model T̂
. This lower bound can be computed by solving maximizing the modified one-step reward r̃(s, a) = r(s, a) − λ u(s, a) on the learned model T̂
.
Pros of MOPO:
T̂
is more accurate, and larger when the learned model T̂
is accurate. Hence the agent would pay a price to exploit a state/action tuple where the transition model is less accurate, so as to balance between exploration and exploitation.Cons of MOPO:
As a comparison, our proposed regularization term −λ log p(s, a)
shares the the second advantage (penalizing the total cost if the model exploits unseen state/action pair), but it doesn't require computing this regularization term through a loose bound, and it is easy to approximate gradient of this regularization term through diffusion.
Eventually we will compare our approach against MOPO, and the comparison metric is the true objective ∑ γᵗ r(s, a) on the real environment, when both MOPO and our system use the same training data.
But MOPO uses an ensemble of models to estimate the error, I don't immediately see why we have to use an ensemble of models. If we compare our approach with one dynamics model versus MOPO with an ensemble of models, is that still a fair comparison?
Why Model-Based?
What about other generative models?
What about other planners / policy optimizers that use diffusion?
What if we don't include distribution risk / uncertainty?
What about other approaches that tackle similar distribution risk problems?