showlab / Show-1

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
https://showlab.github.io/Show-1/
Other
1.07k stars 63 forks source link

The final resolution of the LDM-based super-resolution. #1

Closed Little-Podi closed 9 months ago

Little-Podi commented 9 months ago

Happy mid-autumn festival and congrats to your insightful work. I have a minor question about the resolution. The pixel-based VDM generates a video of frame size $256\times160$, while the LDM upsamples the video frames to $576\times320$. Does it mean the ratio of size is changed? I am just curious why LDM is not performing at $512\times320$.

junhaozhang98 commented 9 months ago

Thanks for your question and happy mid-autumn festival! We adhere to the standards set by ZeroScope to ensure easy and unbiased comparisons. Additionally, LDM's upsampling can adapt well to minor ratio variations.

Moreover, finetune the LDM model for slightly ratio or size change is quite fast(about 4000-6000 steps).

Little-Podi commented 9 months ago

I see. Thanks for your detailed answer.