Closed yjhong89 closed 1 week ago
According to our experiments, CFG (for text-conditioning) is quite important for video motion and quality. Could you please test the model with some simple prompts, such as "a person smiling"?
According to our experiments, CFG (for text-conditioning) is quite important for video motion and quality. Could you please test the model with some simple prompts, such as "a person smiling"?
Yes, the CFG is important. So I guess you should not use the null text embedding. Our checkpoint is only trained on text-to-video generation. Can you try again by using a simple text prompt?
Thanks for your comments.
The portrait remains happy throughout the video clip. Starts with a wide smile, raising cheeks, tightening lids, and pulling up the upper lip. It then transitions into a jaw drop and slight dimpling, maintaining the lip raise.
<Results of 768p>
https://github.com/user-attachments/assets/05a1a95c-6be5-48f8-b8d8-32484a275bec
https://github.com/user-attachments/assets/a65e2c76-cb04-4f88-b6e4-4960876e1c2d
Hi! Thanks for releasing 768p model.
I tested I2V inference with this model, check some artifacts and temporal consistency issue in generated videos.