MotionMaster: Training-free Camera Motion Transfer For Video Generation

Model/Pipeline/Scheduler description

Currently, most existing camera motion control methods for video generation with denoising diffusion models rely on training a temporal camera module, and necessitate substantial computation resources due to the large amount of parameters in video generation models.

The authors of MotionMaster, a novel training-free video motion transfer model, first disentangling camera and object motion embeddings extracted from temporal attention maps during the DDIM inversion of the source video(s), and then transferring the extracted camera motion to new videos through two methods:

A one-shot camera motion disentanglement method given a single source video, which cuts out the temporal attention map of the foreground region to disentangle foreground object motion, and then estimates the camera motion component of the temporal attention map in the foreground region by solving a Poisson equation to satisfy smoothness and boundary constraints.
A few-shot camera motion disentanglement method to extract common camera motion from multiple videos, which employs a window-based clustering technique for each spatial token to extract common features from temporal attention maps of multiple videos.

Finally, the authors demonstrate the linearity and spatial-token decomposability of the latent space of camera motion features formed by the extracted temporal attention maps, enabling further flexibility in combining and altering camera motion features before injection into target videos.

Open source status

[X] The model implementation is available.
[X] The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

Github: https://github.com/sjtuplayer/MotionMaster Paper: https://arxiv.org/pdf/2404.15789 Project Website: https://sjtuplayer.github.io/projects/MotionMaster/ Main author: @sjtuplayer

huggingface / diffusers