seervideodiffusion / SeerVideoLDM

[ICLR 2024] Seer: Language Instructed Video Prediction with Latent Diffusion Models
16 stars 2 forks source link

Inquiry About Direct Image Generation and Pre-trained Models #3

Open freshcoffee22 opened 3 months ago

freshcoffee22 commented 3 months ago

Are you able to use this model to generate the next frame image directly? Also, do you happen to have any pre-trained models available?

XianfanGu commented 3 months ago

Hi, as we know, the proposed model is limited to generating video in fixed length (12 or 16 frames). I suggest sampling a fixed-length video and then picking the next frame for iterative sampling. We have indeed tested long frame generation in the paper (up to 22 frames), and the results generate repeat motion since this model can only understand a one-step (e.g., pick up sth) video motion. We have released all pre-trained models we mentioned in the main experiment results of the paper. hope this answer can work for you.

freshcoffee22 commented 3 months ago

Dear Mr. Gu, Hope you are doing great! Thank you very much for your reply and guidance! Actually, we don't require generating 12, 16, or 22 frames. Our need is quite straightforward: we only need to generate/predict one single next frame. Specifically, what we aim for is to predict the next frame given an input video clip. Could you please kindly advise us on how to utilize your model to predict the next frame conditioned on previous video? Thanks so very much for your help and we really really appreciate your help! I wish to extend my heartfelt appreciation for your remarkable open-source video generation initiative with highly commendable outcomes. Your project has undeniably made a substantial impact within the realm of video synthesis, and I sincerely commend and congratulate you on your diligent endeavors and the exceptional contributions they have brought forth to this domain!

XianfanGu commented 2 months ago

Thank you for your response. Unfortunately, our proposed model cannot support an autoregressive manner for predicting only one next frame. The diffusion model tends to generate whole video clips aligned with global text. You still have to sample a fixed-length video and then pick the next frame for iterative sampling if you need to generate frames autoregressively, which will cause a higher computation cost than the autoregressive model.