jy0205 / Pyramid-Flow

Code of Pyramidal Flow Matching for Efficient Video Generative Modeling
https://pyramid-flow.github.io/
MIT License
2.43k stars 238 forks source link

flux 768p result #177

Closed yjhong89 closed 1 week ago

yjhong89 commented 1 week ago

Hi! Thanks for releasing 768p model.

I tested I2V inference with this model, check some artifacts and temporal consistency issue in generated videos.

https://github.com/user-attachments/assets/40db3853-b987-4d24-9f69-506a8e6ccfc4 https://github.com/user-attachments/assets/3a51fbf5-3bda-432c-85e3-e67d11f7c4fe https://github.com/user-attachments/assets/2318eb16-5571-4ee9-8d5b-2eb84e99e167 https://github.com/user-attachments/assets/335908a3-fada-4d76-9ac5-2311366ebe12 - As you can see, as generated video goes on, video looks visually blurred and seems subject consistency also tends to be broken. - Why this happend? - Since 768p model is trained with 768x1280 size, RoPE don't handle unseen width/height (1024x1024) well. - Though text is not given at all, I think text doens't matter because 384p model doesn't show those behavior. (Inference with 512x512) - What do you think? <384p samples> https://github.com/user-attachments/assets/0c8c75e2-1cdb-486a-8083-a5886e0060b7 https://github.com/user-attachments/assets/da43a847-bd6e-40c5-87d8-b85cfc2fc126
feifeiobama commented 1 week ago

According to our experiments, CFG (for text-conditioning) is quite important for video motion and quality. Could you please test the model with some simple prompts, such as "a person smiling"?

jy0205 commented 1 week ago

According to our experiments, CFG (for text-conditioning) is quite important for video motion and quality. Could you please test the model with some simple prompts, such as "a person smiling"?

Yes, the CFG is important. So I guess you should not use the null text embedding. Our checkpoint is only trained on text-to-video generation. Can you try again by using a simple text prompt?

yjhong89 commented 1 week ago

Thanks for your comments.

<Results of 768p>

https://github.com/user-attachments/assets/05a1a95c-6be5-48f8-b8d8-32484a275bec

https://github.com/user-attachments/assets/a65e2c76-cb04-4f88-b6e4-4960876e1c2d