2023 Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

Introduction

To bypass these challenges, our key idea is to utilize the abundance of existing video clips and synthesize a coherent storytelling video by customizing their appearances

This paper propose a method to generate a video with customized appearance by utilizing the motion and structure of an existing video. Specifically, they first use Motion Structure Retrieval module to query some candidate videos and extract the motions from them according to the input prompt. Then, the use Structure-Guided Text-to-Video Synthesis to generate videos under the guidance of the text input and the motion extracted in the previous step. They also used the video version of text-inversion(*TimeInv) to maintain the consistency of a character.

Method

Video generation

extract plots(single event without shot transition) from the storyboard and create its corresponding text description and prompt manually or by the assistance of LLMs.
use the off-the-shelf text-based video retrieval engine[^1] to obtain the video candidates. Note the real video context doesn't need to be matched with what the target deserved. For example, "a boy plays with a dog in the park" can be used for "a elf plays with a butterfly in the forest".
extract the motion structure by depth estimation networks[^2].

Video Character Rendering(TimeInv)

TimeInv is based on the observation that different timesteps control the rendering of different image attributes dur- ing the inference stage. For example, the previous timesteps of the denoising process control the global layout and object shape, and the later timesteps of the denoising process control the low-level details like texture and color [Voynov et al. 2022][^3]

TimeInv is the tilmestep-dependent version of text inversion, which apply different embeddings to the different timesteps in diffusion process. Specifically, authors optimize a text inversion table which records the different embeddings in different time steps.

Somehow like LoRA, they also add low-rank module to the attention module for optimization.

To solve the appearance conflict between the guidance video and customized character, authors apply the guidance only before the $\tau$ steps of the diffusion process. The following image show the effectiveness of the $\tau$

Highlight

enable to generate relative long video
both the subject motions and camera motions(zoom-in, zoom-out) are controllable
the paper propose a system instead of just a machine learning algorithm
TimeInv can be used in static image and replace text inversion.

Limitation

Since the motion is under the guidance of existing videos, the generated videos might have some license issue.
Like above, this algorithm requires a large video database and efficient searching and motion-extracting algorithm

Comments

Great paper. There is more information in the evaluation section. Since there is only a limited amount of research on customizing diffusion videos, they compare their method to other methods used for customizing static images. Obviously, other methods tend to maintain a static scene due to the overemphasis on customization. The author mentioned in "Video Customization with Image Data" that they address this problem by "repeating the concept image to create a pseudo video with 𝐿 frames and extracting frame-by-frame depth signals to control the video generation model for synthesizing static concept videos." However, I am having difficulty grasping the idea

[^1]: Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. [Bain et al. 2021] [^2]: [Midas] Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero- Shot Cross-Dataset Transfer. [Rene et al. 2022] [^3]: Sketch-guided text-to-image diffusion models

pomelyu / paper-reading-notes