maxin-cn / Cinemo

Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models
Apache License 2.0
213 stars 18 forks source link

video length #9

Closed cocktailpeanut closed 4 weeks ago

cocktailpeanut commented 1 month ago

how many seconds of video can this generate at which FPS? I did look into the code but wanted to confirm what the longest possible duration the model supports.

maxin-cn commented 1 month ago

It can generate 16 frames. However, Cinemo supports autogressive generation. In theory, very long videos can be generated.

thipokKub commented 1 month ago

Sorry, but I'm a little bit confused. Given n-frames, you learn how to predict the movement residual of the rest of 15 frames using U-Net. During the training phase n is fixed at 16 frames, and I think it's a hyper-parameter (based on training dataset). So I don't see how it can be extended into using autoregressive model, or you are referring to chaining multiple generated video together?

Note - I've tried changing the generated video length, but the results degraded quite fast with okay quality at 24 frames. But more than that, the video become very jittery, and splotchy

maxin-cn commented 1 month ago

Hi @thipokKub , the most direct way to generate a longer video is to use the last frame of the previous video clip as the input image for the next generation. Repeat this process over and over again and you can get very long videos.

thipokKub commented 1 month ago

I see your point, but if the motion last more than n frames, then it will be chopped in the middle. For example if it is an image of a bouncing ball, and the last frame is going up. When chain with a new one, it could interpret the ball as falling instead of rising. At least for chaining generated video, I think additional context probably need to be pass

maxin-cn commented 1 month ago

If we want to generate a long video, I think we need to pass a different prompt, not always the original prompt.

thipokKub commented 1 month ago

Do you think that learning the motion residual implicitly learn the temporal embedding like LLM positional embedding? In a sense that it learn the relative sequence to the first frame

maxin-cn commented 1 month ago

Cinemo is build upon Lavie. LaVie also used RoPE to add temporal positional embeddings. Therefore, Cinemo should not need to learn temporal embeddings.

github-actions[bot] commented 1 month ago

Hi There! 👋

This issue has been marked as stale due to inactivity for 14 days.

We would like to inquire if you still have the same problem or if it has been resolved.

If you need further assistance, please feel free to respond to this comment within the next 7 days. Otherwise, the issue will be automatically closed.

We appreciate your understanding and would like to express our gratitude for your contribution to Cinemo. Thank you for your support. 🙏