Compare against Janner's approach

I think when we use the "direct collocation" formulation

min c(x₁, ..., xₙ, u₁, ..., uₙ) + β log p(xᵢ, uᵢ, xᵢ₊₁)

although it looks like Janner's approach in this objective function, in practice our approach is easier, for the following reason:

In Janner's approach, they need to train a classifier to guide the diffusion process. Note that they cannot use the cost function exp(c(x₁ᵗ, ..., xₙᵗ, u₁ᵗ, ..., uₙᵗ)) as this guided classifier directly, but have to train a separate classifier. (Here I use the superscript t on x₁ᵗ to denote it is the t step of denoising, not the t-step in planning horizon). The reason is that during the denoising stage, the trajectory x₁ᵗ, ..., xₙᵗ, u₁ᵗ, ..., uₙᵗ contains a lot of noise, and what the classifier wants to predict is the probability of the denoised trajectory x₁⁰, ..., xₙ⁰, u₁⁰, ..., uₙ⁰ being optimal, not the optimality of the noisy trajectory. So to train this guided classifier, they start with a no-noise trajectory x₁⁰, ..., xₙ⁰, u₁⁰, ..., uₙ⁰, and then inject noise into for multiple steps, then they pair the matching from the noisy trajectory x₁ᵗ, ..., xₙᵗ, u₁ᵗ, ..., uₙᵗ with target probability exp(c(x₁⁰, ..., xₙ⁰, u₁⁰, ..., uₙ⁰)), and train a classifier model through regression. There is this extra effort to train the classifer, while we just use the cost function c(x₁, ..., xₙ, u₁, ..., uₙ) directly.

This also puts the question that if we should consider a classifier-free planning approach, such as https://arxiv.org/pdf/2211.15657.pdf?

hjsuh94 / score_po

Compare against Janner's approach #51