Closed 2019211753 closed 4 months ago
Hi, The main takeaway from Appendix A is, at a particular t, the contributions of the neural networks toward the μt are not the same. Recall that at each t, we update μt as a weighted sum of the model prediction and the current xt. For x0, the network's contribution is large near 0. For ε, the network's contribution is large near T.
This figure might be easier to understand.
What you see is how μt behaves differently with classifier guidance depending on whether we use x0 or ε.
The key idea is if the networks contribution is very large at the end (t close to 0), it can undo all the guidance that we have done until that point. The underlining assumption here is that because the network has never seen xt with guidance before, it will try to revert the change back to be in distribution, thus resisting the guidance.
If we only look at the output without classifier guidance, the result will be roughly the same for both design.
@korrawe I agree with your opinion that the contributions of the neural networks toward μt are not the same. However, what confuses me is that the final result of μt for both designs is theoretically equal from my perspective. If this is the case, which network contributes more should not be so significant. Additionally, why does the large contribution of x0 resist the guidance? Does a large contribution of x0 equate to a large μt and weak guidance strength?
Of course, in the generated motion trajectory figure, it seems unequal, indicating that the final result of μt for each design is indeed different. This is similar to how DDPM uses ε prediction instead of x0 prediction, even though they are theoretically equivalent.
Thank you for addressing my questions. I look forward to your assistance!
Can you elaborate a bit more on why you think the contribution of μt is equal in both cases?
What is important is when each network contributes more. By definition, the classifier guidance will make the output out of distribution in the eye of the network, which it will try to revert back.
If the influence of the network is high near the end, it can revert back the guidance that happened so far.
@korrawe I mean that the μ and σ of q(xt−1∣xt,x0) are equal to those of q(xt−1∣xt,ϵ). Therefore, the classifier guidance μ−s⋅σ⋅Δ should be the same in both designs. I am not sure if what I said is correct. Could you please correct me if I'm wrong?
The terms q(xt−1∣xt,x0) and q(xt−1∣xt,ϵ) are equivalent, but that is not what the network predicts. In practice, we are not predicting xt-1, we predict the change from xt to xt-1.
If you look at the figure above, in both designs, the term μ is a sum of the current network prediction and the current result (xt). The coefficients of these terms are conditioned on step number. However, they are not equal at the same t in both designed.
Maybe the easiest way to explain this is to ask “How much the network prediction at t changes the current output xt?”. The curve shows how much it can change xt at any give t.
@korrawe Thank you very much for your patient assistance. I understand the "contribution of the network prediction to the current output". Then could i interpret it as the term μ, which is a sum of the current network prediction and the current result (xt), is not equal at the same t in both designed?
What we are interested in is not if the term μ is equal in both design, right? They are not comparable in both designs because the network will predict different things. But that is another topic for discussion.
What we are interested in is the “ratio of contribution of the network prediction to the current output” e.g. how much the model can change the current result.
@korrawe I think I understand your point. I wanted to compare the two designs based on μt and assumed that both should be the same. However, it turns out they are not directly comparable. Therefore, we should focus on the "contribution" and experimental results using both designs instead of theoretically comparing the term μ in the two designs.
In Appendix A of this paper, the analysis compares x0 models and ϵ models in diffusion probabilistic models (DPMs), suggesting that ε model is restricted to make a smaller change over time while an x0 model can still make a large change even at the very end of the diffusion process. This is confusing to me. Are μt and σt different between the ϵ model and the x0 model? Aren't they the same over time? How does this affect classifier guidance?