merlresearch / TI2V-Zero

Text-conditioned image-to-video generation based on diffusion models.
GNU Affero General Public License v3.0
35 stars 1 forks source link

TI2V-Zero (CVPR 2024)

This repository contains the implementation of the paper:

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models"
Haomiao Ni, Bernhard Egger Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X. Huang, Tim K Marks

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

[Project Page]

Summary

Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (\eg, a woman's photo) and a text description e.g., "a woman is drinking water". Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this work, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.

Quick Start


  1. Install required dependencies. First create a conda environment using conda create --name ti2v python=3.8. Activate the conda environment using conda activate ti2v. Then use pip install -r requirements.txt to install the remaining dependencies.
  2. Run python initialization.py to download pretrained ModelScope models from HuggingFace.
  3. Run python demo_img2vid.py to generate videos by providing an image and a text input.

You can set the image path and text input in this file manually. By default, the file uses example images and text inputs. The example images in the examples/ folder were generated using Stable Diffusion

Generating Videos using Public Datasets


MUG Dataset

  1. Download MUG dataset from their website.
  2. After installing dependencies, run python gen_video_mug.py to generate videos. Please set the paths in the code files if needed.

UCF101 Dataset

  1. Download UCF101 dataset from their website.
  2. Preprocess the dataset to sample frames from video. You may use our preprocessing function in datasets_ucf.py.
  3. After installing dependencies, run python gen_video_ucf.py to generate videos. Please set the paths in the code files if needed.

Contributing

See CONTRIBUTING.md for our policy on contributions.

License

Released under AGPL-3.0-or-later license, as found in the LICENSE.md file.

All files, except as noted below:

Copyright (c) 2024 Mitsubishi Electric Research Laboratories (MERL)
SPDX-License-Identifier: AGPL-3.0-or-later

The following files

were adapted from https://github.com/modelscope/modelscope/tree/57791a8cc59ccf9eda8b94a9a9512d9e3029c00b/modelscope/models/multi_modal/video_synthesis (license included in LICENSES/Apache-2.0.txt):

Copyright (c) 2024 Mitsubishi Electric Research Laboratories (MERL)
Copyright (c) 2021-2022 The Alibaba Fundamental Vision Team Authors

The following file

was adapted from https://github.com/modelscope/modelscope/blob/bedec553c17b7e297da9db466fee61ccbd4295ba/modelscope/pipelines/multi_modal/text_to_video_synthesis_pipeline.py (license included in LICENSES/Apache-2.0.txt)

Copyright (c) 2024 Mitsubishi Electric Research Laboratories (MERL)
Copyright (c) Alibaba, Inc. and its affiliates.

The following file

was adapted from https://github.com/modelscope/modelscope/blob/57791a8cc59ccf9eda8b94a9a9512d9e3029c00b/modelscope/models/cv/anydoor/ldm/util.py (license included in LICENSES/Apache-2.0.txt):

Copyright (c) 2024 Mitsubishi Electric Research Laboratories (MERL)
Copyright (c) 2021-2022 The Alibaba Fundamental Vision Team Authors. All rights reserved.

The following files:

were adapted from LFDM (license included in LICENSES/BSD-2-Clause.txt):

Copyright (c) 2024 Mitsubishi Electric Research Laboratories (MERL)
Copyright (C) 2023 NEC Laboratories America, Inc. ("NECLA"). All rights reserved.

The following files

were adapted from LFDM (license included in LICENSES/BSD-2-Clause.txt):

Copyright (c) 2024 Mitsubishi Electric Research Laboratories (MERL)
Copyright (C) 2023 NEC Laboratories America, Inc. ("NECLA"). All rights reserved.

Citation

If you use our work, please use the following citation

@inproceedings{ni2024ti2v,
  title={TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models},
  author={Ni, Haomiao and Egger, Bernhard and Lohit, Suhas and Cherian, Anoop and Wang, Ye and Koike-Akino, Toshiaki and Huang, Sharon X and Marks, Tim K},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}