jjihwan / FIFO-Diffusion_public

Official implementation of FIFO-Diffusion: Generating Infinite Videos from Text without Training (NeurIPS 2024)
https://jjihwan.github.io
397 stars 26 forks source link

question about trade off between quality and speed #2

Closed Dorniwang closed 6 months ago

Dorniwang commented 6 months ago

From sec.4.2 of the paper, it seems like latent partitioning, which improve quality by reducing the gap between training and inference, increases denoising step, thus we need to use multi gpus to speed up or we just get a slower inference process?

jjihwan commented 6 months ago

Yes, since latent partitioning requires more computation for generating one frame than vanilla diagonal denoising, you might choose either using multiple GPUs or slower inference. However, latent partitioning with n=4 uses 64 inference steps (16×4), which is not slower than original video diffusion models (they often use 50 to 150 steps for inference). In fact, it is much faster than VDMs when using multiple GPUs.

Dorniwang commented 6 months ago

Yes, since latent partitioning requires more computation for generating one frame than vanilla diagonal denoising, you might choose either using multiple GPUs or slower inference. However, latent partitioning with n=4 uses 64 inference steps (16×4), which is not slower than original video diffusion models (they often use 50 to 150 steps for inference). In fact, it is much faster than VDMs when using multiple GPUs.

Got, thanks