huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.41k stars 5.26k forks source link

Video generation with stable diffusion #1962

Open feizc opened 1 year ago

feizc commented 1 year ago

Model/Pipeline/Scheduler description

Hey,

Thanks for sharing.

Please check out my modified version of video generation with stable diffusion: https://github.com/feizc/Video-Stable-Diffusion

Open source status

Provide useful links for the implementation

https://github.com/feizc/Video-Stable-Diffusion

patrickvonplaten commented 1 year ago

I think this could make a cool community pipeline. If anybody is interested in opening a PR for a community pipeline: https://github.com/huggingface/diffusers/issues/841

aandyw commented 1 year ago

I'd be interested in taking this up if no one has taken it yet.

patrickvonplaten commented 1 year ago

This would be very nice @Pie31415 :heart_eyes:

basab-gupta commented 1 year ago

I hope it's not too late, but I would love to hop onto this as well if it's okay?

aandyw commented 1 year ago

@basab-gupta Sorry, been having some trouble adapting the pipeline. I'll try to get a draft up soon.

aandyw commented 1 year ago

@feizc @patrickvonplaten Pipeline implemented. Let me know if you have any feedback or suggestions for the implementation.

aandyw commented 1 year ago

I hope it's not too late, but I would love to hop onto this as well if it's okay?

Would you like to take over this pipeline implementation? Not sure if I'll have enough time to figure out how to rework everything.

patrickvonplaten commented 1 year ago

@basab-gupta in case you have time, feel free to give this implementation a try :-)

zhouliang-yu commented 1 year ago

I have a question related to video generation. is there any off-the-shelf video generation model that can do this: given a text prompt, and the first frame of the video, the model can generate the frame in the future. for example, given a picture in the kitchen, and text prompt "make me a chicken soup", the model take the visual and text signal and generate the video of making chicken soup, basing on the first frame we provided

silvererudite commented 1 year ago

@patrickvonplaten with this new addition to diffusers here https://huggingface.co/docs/diffusers/main/en/api/pipelines/text_to_video does this solve the needs of this issue or if there's any other way to contribute, I'd love to know.

patrickvonplaten commented 1 year ago

Hey @silvererudite ,

Yes I think the new text-to-video model is probs a bit more powerful than the one proposed here. But they are lots of others ways to contribute! Could you maybe check: https://github.com/huggingface/diffusers/blob/main/CONTRIBUTING.md ? :-)

aandyw commented 1 year ago

@patrickvonplaten Should this issue be closed then if there is already an existing pipeline?

patrickvonplaten commented 1 year ago

Yes, I'll close it - hope that's ok/understandable for everybody!