This repository contains the implementation of the paper:
TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models"
Haomiao Ni, Bernhard Egger Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X. Huang, Tim K Marks
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (\eg, a woman's photo) and a text description e.g., "a woman is drinking water". Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this work, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.
conda create --name ti2v python=3.8
. Activate the conda environment using conda activate ti2v
. Then use pip install -r requirements.txt
to install the remaining dependencies.python initialization.py
to download pretrained ModelScope models from HuggingFace.python demo_img2vid.py
to generate videos by providing an image and a text input.You can set the image path and text input in this file manually. By default, the file uses example images and text inputs. The example images in the examples/
folder were generated using Stable Diffusion
MUG Dataset
python gen_video_mug.py
to generate videos. Please set the paths in the code files if needed.UCF101 Dataset
datasets_ucf.py
.python gen_video_ucf.py
to generate videos. Please set the paths in the code files if needed.See CONTRIBUTING.md for our policy on contributions.
Released under AGPL-3.0-or-later
license, as found in the LICENSE.md file.
All files, except as noted below:
Copyright (c) 2024 Mitsubishi Electric Research Laboratories (MERL)
SPDX-License-Identifier: AGPL-3.0-or-later
The following files
autoencoder.py
diffusion.py
modelscope_t2v.py
unet_sd.py
were adapted from https://github.com/modelscope/modelscope/tree/57791a8cc59ccf9eda8b94a9a9512d9e3029c00b/modelscope/models/multi_modal/video_synthesis (license included in LICENSES/Apache-2.0.txt):
Copyright (c) 2024 Mitsubishi Electric Research Laboratories (MERL)
Copyright (c) 2021-2022 The Alibaba Fundamental Vision Team Authors
The following file
modelscope_t2v_pipeline.py
was adapted from https://github.com/modelscope/modelscope/blob/bedec553c17b7e297da9db466fee61ccbd4295ba/modelscope/pipelines/multi_modal/text_to_video_synthesis_pipeline.py (license included in LICENSES/Apache-2.0.txt)
Copyright (c) 2024 Mitsubishi Electric Research Laboratories (MERL)
Copyright (c) Alibaba, Inc. and its affiliates.
The following file
util.py
was adapted from https://github.com/modelscope/modelscope/blob/57791a8cc59ccf9eda8b94a9a9512d9e3029c00b/modelscope/models/cv/anydoor/ldm/util.py (license included in LICENSES/Apache-2.0.txt):
Copyright (c) 2024 Mitsubishi Electric Research Laboratories (MERL)
Copyright (c) 2021-2022 The Alibaba Fundamental Vision Team Authors. All rights reserved.
The following files:
dataset/datasets_mug.py
dataset/datasets_ucf.py
were adapted from LFDM (license included in LICENSES/BSD-2-Clause.txt):
Copyright (c) 2024 Mitsubishi Electric Research Laboratories (MERL)
Copyright (C) 2023 NEC Laboratories America, Inc. ("NECLA"). All rights reserved.
The following files
demo_img2vid.py
gen_video_mug.py
gen_video_ucf.py
were adapted from LFDM (license included in LICENSES/BSD-2-Clause.txt):
Copyright (c) 2024 Mitsubishi Electric Research Laboratories (MERL)
Copyright (C) 2023 NEC Laboratories America, Inc. ("NECLA"). All rights reserved.
If you use our work, please use the following citation
@inproceedings{ni2024ti2v,
title={TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models},
author={Ni, Haomiao and Egger, Bernhard and Lohit, Suhas and Cherian, Anoop and Wang, Ye and Koike-Akino, Toshiaki and Huang, Sharon X and Marks, Tim K},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}