Add MotionCntrl - Githubissues

DN6 commented 9 months ago

Model/Pipeline/Scheduler description

The MotionCntrl model seems like a promising way to control movement in video generation models. According to the project page, it can work with both SVD and AnimateDiff.

It would be nice to get an idea of how we might be able to incorporate this model into diffusers

Github: https://github.com/TencentARC/MotionCtrl HF Checkpoint: https://huggingface.co/TencentARC/MotionCtrl

Open source status

[X] The model implementation is available.
[X] The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

No response

a-r-r-o-w commented 8 months ago

Paper highlights/notes

### motivation Motion in a video primarily consist of camera movement and object movement. Many existing works focus on one type of motion and do not clearly distinguish between the two. MotionCtrl presents a method for effectively controlling both, enabling flexible and diverse combination of both types of motion. ### highlights - There are two kinds of primary motions in a video: camera motion - zoom in/out, panning, etc., and object motion - The authors propose a method for effectively controlling both motions in videos, which many previous methods lack to distinguish between. - The motion conditions are determined by camera poses and trajectories. Have minimal impact on the appareance of objects in generated video. - Can be applied to many existing video generation techniques - Stable Video Diffusion, AnimateDiff, VideoCrafter, etc., i.e. it is very generalizable. - Two trainable modules are introduced - Camera Motion Control module (CMCM) and Object Motion Control module (OMCM). - Augment Realestate-10k using Blip2 to obtain a video dataset for CMCM containing captions and camera pose annotations. - Augment WebVid with object motions synthesized by the motion segmentation algorithm introduced in [ParticleSfM][]. - Training is only performed on above mentioned modules. Weights of pretrained diffusion models are kept frozen. Can be thought of as a controlnet-like adapter for motion. ### method **CMCM:** - Interacts with temporal transformer layers in UNet. - Consists of several fully connected layers whose outputs are used to condition the temporal layers. They only influence the second attention layer (see [this](https://github.com/huggingface/diffusers/blob/8581d9bce43ac2747199f848a9ac861352495d10/src/diffusers/models/attention.py#L456)) in order to not impact the generative knowledge/performance too much. - Takes as input a list of camera poses $\text{RT} = [\text{RT}\_{0}, \text{RT}\_{1}, ..., \text{RT}\_{L-1}]$ where $L$ is the length of the video to be generated. - Each camera pose can be described using 12 values (9 for camera rotation and 3 for translation). This is, typically, how locations of 3D objects are represented in graphics programming or computer vision. See [Camera Matrix][] for more details. This means our input for camera poses is a vector of size $L \times 12$. - The pose vectors are repeated $H \times W \times C$, where $H$ and $W$ represent height and width, respectively, of the video to be generated. Note that these dimensions are the respective input dimensions of the UNet sublayers, i.e., the repetitions vary at each sublayer (yet to confirm but seems like it). After repetition, the effective size of the pose vectors is $[H \times W, L, 12]$. - At every sublayer, the "regular" latent inputs (of shape $[L, H, W, C]$ are reshaped to $[H \times W, L, C]$) are concatenated with the new conditioning from poses resulting in a vector of dimensionality $[H \times W, L, C + 12]$. This is projected through a regular fully connected layer to reduce dimensionality back to that of original inputs - $[H \times W, L, C]$, which are also trainable layers. The final outputs from FC layer are fed into the, above-mentioned, second attention layer. **OMCM:** - Interacts with spatial convolutional layers in UNet. A list of values $(x_i, y_i)$ can be provided that determine the object motion trajectory. - Similar to [Controlnet][] and [T2I-Adapter][], trajectory conditioning is only applied to downscale/encoder layers of UNet so as to achieve a balance between controllability and generative quality. - TODO: not provided in SVD so look into it later

From an implementation pov, the authors mention VideoCrafter, Stable Video Diffusion and AnimateDiff on their official project page as supported techniques at the moment. At the moment, diffusers:

supports SVD and AnimateDiff. For the former, only CMCM seems to be supported in the original codebase, while code for the latter is not yet public it seems. So, let's ignore AnimateDiff for now. To support MotionCtrl in SVD, we'd have to add the relevant CMCM modelling (as far as I can tell, for every temporal transformer block, a linear projection layer with in_channels=num_latent_channels + 12 and out_channels=num_latent_channels is added). It does not look like official checkpoints for OMCM exist with SVD, but maybe the one with VideoCrafter is generalizable (but I'm not sure because they do not provide an example either), so we don't have to care about that. The outputs from the linear proj layers are simply added to outputs of attn2 before further processing. These changes seem minimal.
does not support VideoCrafter. Patrick opened #2984 a while back but seems like it was not picked up. Let's maybe open it for community contributions and hopefully someone can take it up (I'd be more than happy to if no one does in the time we implement MotionCtrl SVD for the learning experience).

We'd also need a conversion script to extract SVD and CMCM layers from the provided checkpoint.

Do let me know your thoughts and if I missed something. I believe most of the sampling code remains the same or will only require minimal changes in SVD. If the pipeline is to be generally usable by most, the 4x3 matrix notation of camera orientation/translation might be difficult to understand. Maybe we can rethink how it is provided as input and handle necessary conversion to the expected format internally. That said, I'd be happy to take a stab at this along with DragNUWA, and am open to any help from other interested contributors :)

a-r-r-o-w commented 8 months ago

MotionCtrl specific parameters (the only trainable layers were CMCM (a few linear proj layers for each block), attn2 and norm2). I've been having trouble converting the checkpoint to diffusers format but haven't had the time to fix the conversion script yet. This sort of helps me just use the existing SVD model and replace the weights of these layers.

a-r-r-o-w commented 6 months ago

Similar work but for AnimateDiff: https://github.com/hehao13/CameraCtrl/. They also plan to release for SVD trained on RealEstate10k. It makes use of a camera trajectory encoder in comparison to the extra 12 channels in MotionCtrl SVD. They do have a diffusers pipeline and related modelling code but if this is something of interest in regards to AnimateDiff, I'm happy to help integrate it. cc @DN6 @sayakpaul @yiyixuxu

Project page: https://hehao13.github.io/projects-CameraCtrl/ Paper: https://arxiv.org/abs/2404.02101

jhj7905 commented 6 months ago

MotionCtrl specific parameters (the only trainable layers were CMCM (a few linear proj layers for each block), attn2 and norm2). I've been having trouble converting the checkpoint to diffusers format but haven't had the time to fix the conversion script yet. This sort of helps me just use the existing SVD model and replace the weights of these layers.

@a-r-r-o-w Can you share me the list of trainable layers?

jhj7905 commented 6 months ago

@a-r-r-o-w is it right for trainable para? down_blocks.0.attentions.0.temporal_transformer_blocks.0.norm2.weight down_blocks.0.attentions.0.temporal_transformer_blocks.0.norm2.bias down_blocks.0.attentions.0.temporal_transformer_blocks.0.attn2.to_q.weight down_blocks.0.attentions.0.temporal_transformer_blocks.0.attn2.to_k.weight down_blocks.0.attentions.0.temporal_transformer_blocks.0.attn2.to_v.weight down_blocks.0.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.weight down_blocks.0.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.bias down_blocks.0.attentions.0.temporal_transformer_blocks.0.cc_projection.weight down_blocks.0.attentions.0.temporal_transformer_blocks.0.cc_projection.bias down_blocks.0.attentions.1.temporal_transformer_blocks.0.norm2.weight down_blocks.0.attentions.1.temporal_transformer_blocks.0.norm2.bias down_blocks.0.attentions.1.temporal_transformer_blocks.0.attn2.to_q.weight down_blocks.0.attentions.1.temporal_transformer_blocks.0.attn2.to_k.weight down_blocks.0.attentions.1.temporal_transformer_blocks.0.attn2.to_v.weight down_blocks.0.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.weight down_blocks.0.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.bias down_blocks.0.attentions.1.temporal_transformer_blocks.0.cc_projection.weight down_blocks.0.attentions.1.temporal_transformer_blocks.0.cc_projection.bias down_blocks.1.attentions.0.temporal_transformer_blocks.0.norm2.weight down_blocks.1.attentions.0.temporal_transformer_blocks.0.norm2.bias down_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_q.weight down_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_k.weight down_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_v.weight down_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.weight down_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.bias down_blocks.1.attentions.0.temporal_transformer_blocks.0.cc_projection.weight down_blocks.1.attentions.0.temporal_transformer_blocks.0.cc_projection.bias down_blocks.1.attentions.1.temporal_transformer_blocks.0.norm2.weight down_blocks.1.attentions.1.temporal_transformer_blocks.0.norm2.bias down_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_q.weight down_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_k.weight down_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_v.weight down_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.weight down_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.bias down_blocks.1.attentions.1.temporal_transformer_blocks.0.cc_projection.weight down_blocks.1.attentions.1.temporal_transformer_blocks.0.cc_projection.bias down_blocks.2.attentions.0.temporal_transformer_blocks.0.norm2.weight down_blocks.2.attentions.0.temporal_transformer_blocks.0.norm2.bias down_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_q.weight down_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_k.weight down_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_v.weight down_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.weight down_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.bias down_blocks.2.attentions.0.temporal_transformer_blocks.0.cc_projection.weight down_blocks.2.attentions.0.temporal_transformer_blocks.0.cc_projection.bias down_blocks.2.attentions.1.temporal_transformer_blocks.0.norm2.weight down_blocks.2.attentions.1.temporal_transformer_blocks.0.norm2.bias down_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_q.weight down_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_k.weight down_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_v.weight down_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.weight down_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.bias down_blocks.2.attentions.1.temporal_transformer_blocks.0.cc_projection.weight down_blocks.2.attentions.1.temporal_transformer_blocks.0.cc_projection.bias up_blocks.1.attentions.0.temporal_transformer_blocks.0.norm2.weight up_blocks.1.attentions.0.temporal_transformer_blocks.0.norm2.bias up_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.1.attentions.0.temporal_transformer_blocks.0.cc_projection.weight up_blocks.1.attentions.0.temporal_transformer_blocks.0.cc_projection.bias up_blocks.1.attentions.1.temporal_transformer_blocks.0.norm2.weight up_blocks.1.attentions.1.temporal_transformer_blocks.0.norm2.bias up_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.1.attentions.1.temporal_transformer_blocks.0.cc_projection.weight up_blocks.1.attentions.1.temporal_transformer_blocks.0.cc_projection.bias up_blocks.1.attentions.2.temporal_transformer_blocks.0.norm2.weight up_blocks.1.attentions.2.temporal_transformer_blocks.0.norm2.bias up_blocks.1.attentions.2.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.1.attentions.2.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.1.attentions.2.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.1.attentions.2.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.1.attentions.2.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.1.attentions.2.temporal_transformer_blocks.0.cc_projection.weight up_blocks.1.attentions.2.temporal_transformer_blocks.0.cc_projection.bias up_blocks.2.attentions.0.temporal_transformer_blocks.0.norm2.weight up_blocks.2.attentions.0.temporal_transformer_blocks.0.norm2.bias up_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.2.attentions.0.temporal_transformer_blocks.0.cc_projection.weight up_blocks.2.attentions.0.temporal_transformer_blocks.0.cc_projection.bias up_blocks.2.attentions.1.temporal_transformer_blocks.0.norm2.weight up_blocks.2.attentions.1.temporal_transformer_blocks.0.norm2.bias up_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.2.attentions.1.temporal_transformer_blocks.0.cc_projection.weight up_blocks.2.attentions.1.temporal_transformer_blocks.0.cc_projection.bias up_blocks.2.attentions.2.temporal_transformer_blocks.0.norm2.weight up_blocks.2.attentions.2.temporal_transformer_blocks.0.norm2.bias up_blocks.2.attentions.2.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.2.attentions.2.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.2.attentions.2.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.2.attentions.2.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.2.attentions.2.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.2.attentions.2.temporal_transformer_blocks.0.cc_projection.weight up_blocks.2.attentions.2.temporal_transformer_blocks.0.cc_projection.bias up_blocks.3.attentions.0.temporal_transformer_blocks.0.norm2.weight up_blocks.3.attentions.0.temporal_transformer_blocks.0.norm2.bias up_blocks.3.attentions.0.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.3.attentions.0.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.3.attentions.0.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.3.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.3.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.3.attentions.0.temporal_transformer_blocks.0.cc_projection.weight up_blocks.3.attentions.0.temporal_transformer_blocks.0.cc_projection.bias up_blocks.3.attentions.1.temporal_transformer_blocks.0.norm2.weight up_blocks.3.attentions.1.temporal_transformer_blocks.0.norm2.bias up_blocks.3.attentions.1.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.3.attentions.1.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.3.attentions.1.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.3.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.3.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.3.attentions.1.temporal_transformer_blocks.0.cc_projection.weight up_blocks.3.attentions.1.temporal_transformer_blocks.0.cc_projection.bias up_blocks.3.attentions.2.temporal_transformer_blocks.0.norm2.weight up_blocks.3.attentions.2.temporal_transformer_blocks.0.norm2.bias up_blocks.3.attentions.2.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.3.attentions.2.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.3.attentions.2.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.3.attentions.2.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.3.attentions.2.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.3.attentions.2.temporal_transformer_blocks.0.cc_projection.weight up_blocks.3.attentions.2.temporal_transformer_blocks.0.cc_projection.bias mid_block.attentions.0.temporal_transformer_blocks.0.norm2.weight mid_block.attentions.0.temporal_transformer_blocks.0.norm2.bias mid_block.attentions.0.temporal_transformer_blocks.0.attn2.to_q.weight mid_block.attentions.0.temporal_transformer_blocks.0.attn2.to_k.weight mid_block.attentions.0.temporal_transformer_blocks.0.attn2.to_v.weight mid_block.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.weight mid_block.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.bias mid_block.attentions.0.temporal_transformer_blocks.0.cc_projection.weight mid_block.attentions.0.temporal_transformer_blocks.0.cc_projection.bias

a-r-r-o-w commented 6 months ago

Yes that's correct. attn2 and cc_projection layers are the trainable layers.

a-r-r-o-w commented 1 month ago

Closing due to this.

huggingface / diffusers

Add MotionCntrl #6688

Model/Pipeline/Scheduler description

Open source status

Provide useful links for the implementation