Any plans to add ModelScope's 1.7B text2video synthesis diffusion model?

kabachuha commented 1 year ago

Model/Pipeline/Scheduler description

Hello!

There seems to be a new 1.7B-parameter Diffusion-based model by ModelScope allowing text2video synthesis as noted by AKHaliq https://twitter.com/_akhaliq/status/1637321077553606657?s=20. Both the model implementation and weights (downloaded with their pipeline) are in open access and it's already possible to launch it via HuggingFace's spaces. However, the model lacks a lot of possible optimizations, especially concerning LowVRAM mode, and accessibility options, and I believe it would benefit greatly from the help of Diffusers community.

Example: monkey playing on drums

https://user-images.githubusercontent.com/14872007/226178634-d97b9782-a8fd-4dd1-989f-2544992a96b3.mp4

At this time the model should be fitting around 16 gbs of VRAM, but since it's a combination of 4 gb, 6 gb, and 5 gb models, I believe with half precision and sequential pipeline it will be eventually possible to launch it on modern consumer hardware.

The license is Apache-2.0 license, so there will be no problems with using the code as the reference.