huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.32k stars 5.25k forks source link

Any plans to add ModelScope's 1.7B text2video synthesis diffusion model? #2736

Closed kabachuha closed 1 year ago

kabachuha commented 1 year ago

Model/Pipeline/Scheduler description

Hello!

There seems to be a new 1.7B-parameter Diffusion-based model by ModelScope allowing text2video synthesis as noted by AKHaliq https://twitter.com/_akhaliq/status/1637321077553606657?s=20. Both the model implementation and weights (downloaded with their pipeline) are in open access and it's already possible to launch it via HuggingFace's spaces. However, the model lacks a lot of possible optimizations, especially concerning LowVRAM mode, and accessibility options, and I believe it would benefit greatly from the help of Diffusers community.

Example: monkey playing on drums

https://user-images.githubusercontent.com/14872007/226178634-d97b9782-a8fd-4dd1-989f-2544992a96b3.mp4

At this time the model should be fitting around 16 gbs of VRAM, but since it's a combination of 4 gb, 6 gb, and 5 gb models, I believe with half precision and sequential pipeline it will be eventually possible to launch it on modern consumer hardware.

The license is Apache-2.0 license, so there will be no problems with using the code as the reference.

Open source status

Provide useful links for the implementation

HuggingFace space:

https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis

All the parts of the model at HuggingFace:

https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis/tree/main

The model PyTorch implementation:

https://github.com/modelscope/modelscope/tree/master/modelscope/models/multi_modal/video_synthesis

Google Colab from the devs:

https://colab.research.google.com/drive/1uW1ZqswkQ9Z9bp5Nbo5z59cAn7I0hE6R?usp=sharing

License: Apache-2.0 license

AK391 commented 1 year ago

model is also on huggingface: https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis/tree/main

kabachuha commented 1 year ago

I know, I linked it in the issue description

Correction: linked the space, not the model itself :)

patrickvonplaten commented 1 year ago

On it: https://github.com/huggingface/diffusers/pull/2738

Hope to have it by Wednesday/Thursday

kabachuha commented 1 year ago

Closing it now, as implemented in https://github.com/huggingface/diffusers/pull/2738