hutaiHang / Faster-Diffusion

[NeurIPS 2024] Official implementation of "Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models"
https://arxiv.org/abs/2312.09608
Apache License 2.0
302 stars 19 forks source link

About Parallel encoder #4

Open sonwe1e opened 10 months ago

sonwe1e commented 10 months ago

Great work on the study, but I have some queries I'd like to ask.

If the time-steps considered as non-key directly skip the encoding step of the encoder, how are the images decoded from the features encoded by the key time-step encoder used in these non-key time-steps? Since the encoders at non-key time-steps are skipped, there wouldn't be any encoding at time t+1 either. Why not skip the non-key phases altogether?

sonwe1e commented 10 months ago

My point is that if time t is a key moment, and t+1, t+2, t+3 are non-key, this means that the decoders for t+1, t+2, t+3 all use the features f_t from time t. According to the parallel steps in the paper, t+1, t+2, t+3 all need to decode f_t, but these time steps do not utilize the encoder. So, what is the purpose of the results obtained from this decoding?

I hope I have made my question clear, Thanks

hutaiHang commented 10 months ago

My point is that if time t is a key moment, and t+1, t+2, t+3 are non-key, this means that the decoders for t+1, t+2, t+3 all use the features f_t from time t. According to the parallel steps in the paper, t+1, t+2, t+3 all need to decode f_t, but these time steps do not utilize the encoder. So, what is the purpose of the results obtained from this decoding?

I hope I have made my question clear, Thanks

Even though the encoder of UNet is not used during non-key timesteps, its decoder receives shared encoder features from key timesteps, then outputs the predicted noise $\epsilon$, to updates $z_t$. I hope I understand your question correctly.

sonwe1e commented 10 months ago

Thank you for your answer, it has nicely resolved my doubts. I made a silly mistake.

sonwe1e commented 10 months ago

Thank you again for your response. I have another question. From the graph, it seems that a smaller interval in the Uniform method means fewer skipped encoders, which should mean it's closer to the original diffusion process. But why then is the performance of I worse than that of II image