Closed DN6 closed 1 month ago
From an implementation pov, the authors mention VideoCrafter, Stable Video Diffusion and AnimateDiff on their official project page as supported techniques at the moment. At the moment, diffusers:
in_channels=num_latent_channels + 12
and out_channels=num_latent_channels
is added). It does not look like official checkpoints for OMCM exist with SVD, but maybe the one with VideoCrafter is generalizable (but I'm not sure because they do not provide an example either), so we don't have to care about that. The outputs from the linear proj layers are simply added to outputs of attn2 before further processing. These changes seem minimal.We'd also need a conversion script to extract SVD and CMCM layers from the provided checkpoint.
Do let me know your thoughts and if I missed something. I believe most of the sampling code remains the same or will only require minimal changes in SVD. If the pipeline is to be generally usable by most, the 4x3 matrix notation of camera orientation/translation might be difficult to understand. Maybe we can rethink how it is provided as input and handle necessary conversion to the expected format internally. That said, I'd be happy to take a stab at this along with DragNUWA, and am open to any help from other interested contributors :)
MotionCtrl specific parameters (the only trainable layers were CMCM (a few linear proj layers for each block), attn2 and norm2). I've been having trouble converting the checkpoint to diffusers format but haven't had the time to fix the conversion script yet. This sort of helps me just use the existing SVD model and replace the weights of these layers.
Similar work but for AnimateDiff: https://github.com/hehao13/CameraCtrl/. They also plan to release for SVD trained on RealEstate10k. It makes use of a camera trajectory encoder in comparison to the extra 12 channels in MotionCtrl SVD. They do have a diffusers pipeline and related modelling code but if this is something of interest in regards to AnimateDiff, I'm happy to help integrate it. cc @DN6 @sayakpaul @yiyixuxu
Project page: https://hehao13.github.io/projects-CameraCtrl/ Paper: https://arxiv.org/abs/2404.02101
MotionCtrl specific parameters (the only trainable layers were CMCM (a few linear proj layers for each block), attn2 and norm2). I've been having trouble converting the checkpoint to diffusers format but haven't had the time to fix the conversion script yet. This sort of helps me just use the existing SVD model and replace the weights of these layers.
@a-r-r-o-w Can you share me the list of trainable layers?
@a-r-r-o-w is it right for trainable para? down_blocks.0.attentions.0.temporal_transformer_blocks.0.norm2.weight down_blocks.0.attentions.0.temporal_transformer_blocks.0.norm2.bias down_blocks.0.attentions.0.temporal_transformer_blocks.0.attn2.to_q.weight down_blocks.0.attentions.0.temporal_transformer_blocks.0.attn2.to_k.weight down_blocks.0.attentions.0.temporal_transformer_blocks.0.attn2.to_v.weight down_blocks.0.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.weight down_blocks.0.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.bias down_blocks.0.attentions.0.temporal_transformer_blocks.0.cc_projection.weight down_blocks.0.attentions.0.temporal_transformer_blocks.0.cc_projection.bias down_blocks.0.attentions.1.temporal_transformer_blocks.0.norm2.weight down_blocks.0.attentions.1.temporal_transformer_blocks.0.norm2.bias down_blocks.0.attentions.1.temporal_transformer_blocks.0.attn2.to_q.weight down_blocks.0.attentions.1.temporal_transformer_blocks.0.attn2.to_k.weight down_blocks.0.attentions.1.temporal_transformer_blocks.0.attn2.to_v.weight down_blocks.0.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.weight down_blocks.0.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.bias down_blocks.0.attentions.1.temporal_transformer_blocks.0.cc_projection.weight down_blocks.0.attentions.1.temporal_transformer_blocks.0.cc_projection.bias down_blocks.1.attentions.0.temporal_transformer_blocks.0.norm2.weight down_blocks.1.attentions.0.temporal_transformer_blocks.0.norm2.bias down_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_q.weight down_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_k.weight down_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_v.weight down_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.weight down_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.bias down_blocks.1.attentions.0.temporal_transformer_blocks.0.cc_projection.weight down_blocks.1.attentions.0.temporal_transformer_blocks.0.cc_projection.bias down_blocks.1.attentions.1.temporal_transformer_blocks.0.norm2.weight down_blocks.1.attentions.1.temporal_transformer_blocks.0.norm2.bias down_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_q.weight down_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_k.weight down_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_v.weight down_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.weight down_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.bias down_blocks.1.attentions.1.temporal_transformer_blocks.0.cc_projection.weight down_blocks.1.attentions.1.temporal_transformer_blocks.0.cc_projection.bias down_blocks.2.attentions.0.temporal_transformer_blocks.0.norm2.weight down_blocks.2.attentions.0.temporal_transformer_blocks.0.norm2.bias down_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_q.weight down_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_k.weight down_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_v.weight down_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.weight down_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.bias down_blocks.2.attentions.0.temporal_transformer_blocks.0.cc_projection.weight down_blocks.2.attentions.0.temporal_transformer_blocks.0.cc_projection.bias down_blocks.2.attentions.1.temporal_transformer_blocks.0.norm2.weight down_blocks.2.attentions.1.temporal_transformer_blocks.0.norm2.bias down_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_q.weight down_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_k.weight down_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_v.weight down_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.weight down_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.bias down_blocks.2.attentions.1.temporal_transformer_blocks.0.cc_projection.weight down_blocks.2.attentions.1.temporal_transformer_blocks.0.cc_projection.bias up_blocks.1.attentions.0.temporal_transformer_blocks.0.norm2.weight up_blocks.1.attentions.0.temporal_transformer_blocks.0.norm2.bias up_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.1.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.1.attentions.0.temporal_transformer_blocks.0.cc_projection.weight up_blocks.1.attentions.0.temporal_transformer_blocks.0.cc_projection.bias up_blocks.1.attentions.1.temporal_transformer_blocks.0.norm2.weight up_blocks.1.attentions.1.temporal_transformer_blocks.0.norm2.bias up_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.1.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.1.attentions.1.temporal_transformer_blocks.0.cc_projection.weight up_blocks.1.attentions.1.temporal_transformer_blocks.0.cc_projection.bias up_blocks.1.attentions.2.temporal_transformer_blocks.0.norm2.weight up_blocks.1.attentions.2.temporal_transformer_blocks.0.norm2.bias up_blocks.1.attentions.2.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.1.attentions.2.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.1.attentions.2.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.1.attentions.2.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.1.attentions.2.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.1.attentions.2.temporal_transformer_blocks.0.cc_projection.weight up_blocks.1.attentions.2.temporal_transformer_blocks.0.cc_projection.bias up_blocks.2.attentions.0.temporal_transformer_blocks.0.norm2.weight up_blocks.2.attentions.0.temporal_transformer_blocks.0.norm2.bias up_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.2.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.2.attentions.0.temporal_transformer_blocks.0.cc_projection.weight up_blocks.2.attentions.0.temporal_transformer_blocks.0.cc_projection.bias up_blocks.2.attentions.1.temporal_transformer_blocks.0.norm2.weight up_blocks.2.attentions.1.temporal_transformer_blocks.0.norm2.bias up_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.2.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.2.attentions.1.temporal_transformer_blocks.0.cc_projection.weight up_blocks.2.attentions.1.temporal_transformer_blocks.0.cc_projection.bias up_blocks.2.attentions.2.temporal_transformer_blocks.0.norm2.weight up_blocks.2.attentions.2.temporal_transformer_blocks.0.norm2.bias up_blocks.2.attentions.2.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.2.attentions.2.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.2.attentions.2.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.2.attentions.2.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.2.attentions.2.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.2.attentions.2.temporal_transformer_blocks.0.cc_projection.weight up_blocks.2.attentions.2.temporal_transformer_blocks.0.cc_projection.bias up_blocks.3.attentions.0.temporal_transformer_blocks.0.norm2.weight up_blocks.3.attentions.0.temporal_transformer_blocks.0.norm2.bias up_blocks.3.attentions.0.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.3.attentions.0.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.3.attentions.0.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.3.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.3.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.3.attentions.0.temporal_transformer_blocks.0.cc_projection.weight up_blocks.3.attentions.0.temporal_transformer_blocks.0.cc_projection.bias up_blocks.3.attentions.1.temporal_transformer_blocks.0.norm2.weight up_blocks.3.attentions.1.temporal_transformer_blocks.0.norm2.bias up_blocks.3.attentions.1.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.3.attentions.1.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.3.attentions.1.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.3.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.3.attentions.1.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.3.attentions.1.temporal_transformer_blocks.0.cc_projection.weight up_blocks.3.attentions.1.temporal_transformer_blocks.0.cc_projection.bias up_blocks.3.attentions.2.temporal_transformer_blocks.0.norm2.weight up_blocks.3.attentions.2.temporal_transformer_blocks.0.norm2.bias up_blocks.3.attentions.2.temporal_transformer_blocks.0.attn2.to_q.weight up_blocks.3.attentions.2.temporal_transformer_blocks.0.attn2.to_k.weight up_blocks.3.attentions.2.temporal_transformer_blocks.0.attn2.to_v.weight up_blocks.3.attentions.2.temporal_transformer_blocks.0.attn2.to_out.0.weight up_blocks.3.attentions.2.temporal_transformer_blocks.0.attn2.to_out.0.bias up_blocks.3.attentions.2.temporal_transformer_blocks.0.cc_projection.weight up_blocks.3.attentions.2.temporal_transformer_blocks.0.cc_projection.bias mid_block.attentions.0.temporal_transformer_blocks.0.norm2.weight mid_block.attentions.0.temporal_transformer_blocks.0.norm2.bias mid_block.attentions.0.temporal_transformer_blocks.0.attn2.to_q.weight mid_block.attentions.0.temporal_transformer_blocks.0.attn2.to_k.weight mid_block.attentions.0.temporal_transformer_blocks.0.attn2.to_v.weight mid_block.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.weight mid_block.attentions.0.temporal_transformer_blocks.0.attn2.to_out.0.bias mid_block.attentions.0.temporal_transformer_blocks.0.cc_projection.weight mid_block.attentions.0.temporal_transformer_blocks.0.cc_projection.bias
Yes that's correct. attn2 and cc_projection layers are the trainable layers.
Model/Pipeline/Scheduler description
The MotionCntrl model seems like a promising way to control movement in video generation models. According to the project page, it can work with both SVD and AnimateDiff.
It would be nice to get an idea of how we might be able to incorporate this model into
diffusers
Github: https://github.com/TencentARC/MotionCtrl HF Checkpoint: https://huggingface.co/TencentARC/MotionCtrl
Open source status
Provide useful links for the implementation
No response