[Model + Pipeline] DragNUWA

a-r-r-o-w commented 9 months ago

Model/Pipeline/Scheduler description

DragNUWA enables users to manipulate backgrounds or objects within images directly, and the model seamlessly translates these actions into camera movements or object motions, generating the corresponding video.

example

Thank you for your amazing and absolutely mind-blowing work at Microsoft Research once again! Can't wait to get into the specifics and learn from your paper :heart:

Code: https://github.com/ProjectNUWA/DragNUWA Paper: https://arxiv.org/abs/2308.08089 Project Page: https://www.microsoft.com/en-us/research/project/dragnuwa/ Demo: https://huggingface.co/spaces/yinsming/DragNUWA Authors: @shengming-yin @moymix @tim-learn [Jie Shi] [Houqiang Li] [Gong Ming] @nanduan

Open source status

[X] The model implementation is available.
[X] The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

Regarding implementation: The code base is built upon SVD backbone. Diffusers has, probably, the most intuitive implementation of SVD and adding this should, hopefully, not be too difficult.

@sayakpaul @patrickvonplaten

sayakpaul commented 9 months ago

Very cool. You know the drill by now :D

Feel free to open a PR for community examples.

a-r-r-o-w commented 8 months ago

Apologies for the delay here. I've been working on my first ComfyUI extension for this but have found it slightly difficult to decipher the code base and add as a new extension. Seems like someone already made a really nice extension recently and beat me to it: link! I will focus on converting to a diffusers format pipeline now.

It might be interesting to compare DragNUWA against MotionCntrl since both work with the SVD backbone, although the latter also supports VideoCrafter and AnimateDiff as backbones. This feature is comparable to the Multi-Motion Brush product provided by RunwayML, which I believe uses something similar under the hood.

a-r-r-o-w commented 8 months ago

@sayakpaul @patil-suraj @patrickvonplaten I need some help converting the SVD checkpoint they provide to diffusers format. I see that we have a script for the conversion but it is not very straightforward to use as it does not expose a CLI interface and I've been facing difficulties initiating the conversion by using the code directly. Any pointers on how to go about it would be really helpful, thanks!

sayakpaul commented 8 months ago

What problems are you facing exactly when using the conversion script?

It's better to share checkpoints on the Hub rather than Drive :D Why cloud your precious storage space? :D

a-r-r-o-w commented 8 months ago

What problems are you facing exactly when using the conversion script?

The script does not expose a CLI (which maybe I can take up in a PR) for easy conversion of weights. As more SVD checkpoints are appearing (MotionCtrl, DragNUWA), it would be a nice and easy way to get things ready for testing.

The script also does not seem to work directly when loading the original yaml format config of SGM implementation as dict. Converting the dict into python objects (which is what the script expects due to the dot notation access of attributes at places) still seems to give errors. I will spend some time improving it.

It's better to share checkpoints on the Hub rather than Drive :D Why cloud your precious storage space? :D

Weights are by the author :) Researchers should really start using HF to store weights instead since Drive keeps erroring out if too many people access these large files, and it blocks downloads :laughing: Fortunately, people have downloaded and pushed to hub, but it is not in diffusers format.

From the implementation perspective, I've spent time diving into both DragNUWA and MotionCtrl and understand most of the paper and code. I also feel somewhat confident about being able to implement a training script for both. So far, from testing with the original codebases, MotionCtrl seems to be better then Drag at object consistency so I will prioritize that. Will be opening PRs shortly once I can get the weights converted (both support SVD). The changes should be minimal, and self-contained in the pipelines, but will require modifying the spatio-temporal unet code due to added conditioning from the proposed camera and object modules.

sayakpaul commented 8 months ago

Fortunately, people have downloaded and pushed to hub, but it is not in diffusers format.

Yeah please do provide the relevant link.

Let's maybe start from a community pipeline first.

Let's maybe hold off a bit from the training script as SVD is not commercially friendly unlike SDXL.

For the conversion script, I will defer to @patil-suraj and @DN6.

a-r-r-o-w commented 8 months ago

Sure. Here are the ones for DragNUWA I found uploaded to the hub:

Original weights: https://huggingface.co/LanHarmony/DragNUWA
Safetensors format: https://huggingface.co/benjamin-paine/dragnuwa-pruned-safetensors

Also, MotionCtrl (which also requires conversion): https://huggingface.co/TencentARC/MotionCtrl

Maybe it also makes sense to add an implementation for .from_single_file with the mixin.

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / diffusers