Welcome to the official repository for Reducio Variational Autoencoder (Reducio-VAE)! Reducio-VAE is a model for encoding videos into an extremely small latent space. It is part of the Reducio-DiT, which is a highly efficient video generation method. Reducio-VAE encodes a 16-frame video clip to $T/4*H/32*W/32$ latent space based on a content image prior, which enables 4096x compression rate on the videos. More details can be found in the paper.
Reducio-VAE was developed to enable high compression ratio on videos, supporting efficient video generation. Existing 3D VAEs are generally extended from 2D VAE, which is designed for image generation and has large redundancy when handling video. Compared to 2D VAE, Reducio-VAE achieved 64x higher compression ratio.
A detailed discussion of Reducio-VAE, including how it was developed and tested, can be found in our paper.
Reducio-VAE is best suited for supporting training your own video diffusion model for research purpose.
Reducio-VAE is not well suited for processing long videos. It currently can only handle 1-second video clips.
We do not recommend using Reducio-VAE in commercial or real-world applications without further testing and development. It is being released for research purposes.
Reducio-VAE was not designed or evaluated for all possible downstream purposes. Developers should consider its inherent limitations (more below) as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness concerns specific to each intended downstream use.
We do not recommend using Reducio-VAE in the context of high-risk decision making (e.g. in law enforcement, legal, finance, or healthcare).
Set up the environment with
pip3 install -r requirements.txt
Training data was prepared by self-collecting. We recommend using Pexels or other high-quality videos.
We directly trained Reducio-VAE from scratch on $256*256$ resolution with 16 FPS. No further fine-tuning on $512*512$ or $1024*1024$. To use it on higher resolution, we split videos into tiles in spatial dimensions.
# training Reducio-VAE-ft4-fs32
python -m torch.distributed.launch --nproc_per_node=${GPU_PER_NODE_COUNT} \
--node_rank=${NODE_RANK} \
--nnodes=${NODE_COUNT} \
--master_addr=${MASTER_ADDR} \
--master_port=${MASTER_PORT} \
--use_env main.py \
-b configs/autoencoder/reducio_kl_ft4_fs32_z16_attn23.yaml \
-t -r ${output_dir} -k ${wandb_key}
# Inference with our pre-trained weights
python -m torch.distributed.launch --nproc_per_node=${GPU_PER_NODE_COUNT} \
--master_addr=${MASTER_ADDR} \
--master_port=${MASTER_PORT} \
main.py -r ${output_dir} -t False \
-b configs/autoencoder/reducio_kl_ft4_fs32_z16_attn23.yaml \
--pretrained checkpoint/reducio-ft4-fs32-attn23.ckpt \
-k ${wandb_key}
Users can adjust the batch_frequency
params of Image_logger
in the config. Below is an example of logged input videos (top) sampled from Pexels and reconstructed results (bottom) by Reducio-VAE :
name | $f_t$ | $f_s$ | checkpoint |
---|---|---|---|
Reducio-VAE | 2 | 32 | huggingface |
Reducio-VAE | 4 | 32 | huggingface |
Reducio-VAE performed best on high-quality videos, but worse on low-quality videos like UCF-101. The reason is that the visual quality of UCF-101 is low, where the prior content image is blurring in many cases.
Metrics on 1K Pexels validation set and UCF-101:
Method | Downsample Factor | $\ | z\ | $ | PSNR | SSIM | LPIPS | rFVD (Pexels) | rFVD (UCF-101) |
---|---|---|---|---|---|---|---|---|---|
SD2.1-VAE | $1\times8\times8$ | 4 | 29.23 | 0.82 | 0.09 | 25.96 | 21.00 | ||
SDXL-VAE | $1\times8\times8$ | 16 | 30.54 | 0.85 | 0.08 | 19.87 | 23.68 | ||
OmniTokenizer | $4\times8\times8$ | 8 | 27.11 | 0.89 | 0.07 | 23.88 | 30.52 | ||
OpenSora-1.2 | $4\times8\times8$ | 16 | 30.72 | 0.85 | 0.11 | 60.88 | 67.52 | ||
Cosmos Tokenizer | $8\times8\times8$ | 16 | 30.84 | 0.74 | 0.12 | 29.44 | 22.06 | ||
Cosmos Tokenizer | $8\times16\times16$ | 16 | 28.14 | 0.65 | 0.18 | 77.87 | 119.37 | ||
Reducio-VAE | $4\times32\times32$ | 16 | 35.88 | 0.94 | 0.05 | 17.88 | 65.17 |
Reducio-VAE was developed for research and experimental purposes. Further testing and validation are needed before considering its application in commercial or real-world scenarios.
Currently, Reducio-VAE can only encode 16-frame video clips. Longer length is not supported.
This work is purely a research project. Currently, we have no plans to incorporate it into a product or expand access to the public. We will also put Microsoft AI principles into practice when further developing the models. In our research paper, we account for the ethical concerns associated with text-image-to-video research. To mitigate issues associated with training data, we have implemented a rigorous filtering process to purge our training data of inappropriate content, such as explicit imagery and offensive language, to minimize the likelihood of generating inappropriate content.
The code and model are licensed under the MIT license.
Our code is built on top of Stable Diffusion. We would like to thank the SD team for their foundational work.
If you use our work, please cite:
@article{tian2024reducio,
title={REDUCIO! Generating 1024*1024 Video within 16 Seconds using Extremely Compressed Motion Latents},
author={Tian, Rui and Dai, Qi and Bao, Jianmin and Qiu, Kai and Yang, Yifan and Luo, Chong and Wu, Zuxuan and Jiang, Yu-Gang},
journal={arXiv preprint arXiv:2411.13552},
year={2024}
}