flux 768p result - Githubissues

yjhong89 commented 1 week ago

Hi! Thanks for releasing 768p model.

I tested I2V inference with this model, check some artifacts and temporal consistency issue in generated videos.

The condition image is face image from ffhq. Size of these image is 1024x1024 so I use this condition image without resize
Give null text condition.

https://github.com/user-attachments/assets/40db3853-b987-4d24-9f69-506a8e6ccfc4 https://github.com/user-attachments/assets/3a51fbf5-3bda-432c-85e3-e67d11f7c4fe https://github.com/user-attachments/assets/2318eb16-5571-4ee9-8d5b-2eb84e99e167 https://github.com/user-attachments/assets/335908a3-fada-4d76-9ac5-2311366ebe12 - As you can see, as generated video goes on, video looks visually blurred and seems subject consistency also tends to be broken. - Why this happend? - Since 768p model is trained with 768x1280 size, RoPE don't handle unseen width/height (1024x1024) well. - Though text is not given at all, I think text doens't matter because 384p model doesn't show those behavior. (Inference with 512x512) - What do you think? <384p samples> https://github.com/user-attachments/assets/0c8c75e2-1cdb-486a-8083-a5886e0060b7 https://github.com/user-attachments/assets/da43a847-bd6e-40c5-87d8-b85cfc2fc126

feifeiobama commented 1 week ago

According to our experiments, CFG (for text-conditioning) is quite important for video motion and quality. Could you please test the model with some simple prompts, such as "a person smiling"?

jy0205 commented 1 week ago

According to our experiments, CFG (for text-conditioning) is quite important for video motion and quality. Could you please test the model with some simple prompts, such as "a person smiling"?

Yes, the CFG is important. So I guess you should not use the null text embedding. Our checkpoint is only trained on text-to-video generation. Can you try again by using a simple text prompt?

yjhong89 commented 1 week ago

Thanks for your comments.

So I put prompt like The portrait remains happy throughout the video clip. Starts with a wide smile, raising cheeks, tightening lids, and pulling up the upper lip. It then transitions into a jaw drop and slight dimpling, maintaining the lip raise.
It looks better but still have some quality issue as before.

https://github.com/user-attachments/assets/05a1a95c-6be5-48f8-b8d8-32484a275bec

https://github.com/user-attachments/assets/a65e2c76-cb04-4f88-b6e4-4960876e1c2d

jy0205 / Pyramid-Flow

flux 768p result #177