Scenario: "Mouth Replacement" "Lip Editing", suitable for video translation, live streaming, etc.
conda create -n diffdub python==3.9.0
conda activate diffdub
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
pip install -r requirements.txt
Please download the checkpoint from URL and place them into the following folder.
(Alternatively, you can run bash auto_download_ckpt.sh
to automatically download them using git lfs.)
assets/checkpoints/
├── stage1.ckpt
└── stage2.ckpt
stage1.ckpt
: Lip rendering module. This module takes a mask and lip movements as inputs to render a face with specific lip movements. In this repository, this model is also referred to as the renderer, auto-encoder, or diffusion model, all of which are synonymous.stage2.ckpt
: Sequence generation module. This module takes audio and reference lip movements as inputs to generate a lip sequence (also called Motion/Lip Latent). The generated lip sequence is then passed to the previous model for rendering.Input: One video / Image + an audio
Output: A new lip-synced video driven by the audio
1. One-shot
Only one frame from the video is taken as the reference frame (the code here randomly selects a frame from the video) to be used as the input for the reference lip latent in the second stage.
python demo.py \
--one_shot \
--video_inference \
--stage1_checkpoint_path 'assets/checkpoints/stage1_state_dict.ckpt' \
--stage2_checkpoint_path 'assets/checkpoints/stage2_state_dict.ckpt' \
--saved_path 'assets/samples/RD_Radio30_000/' \
--hubert_feat_path 'assets/samples/WRA_LamarAlexander_000/WRA_LamarAlexander_000.npy' \
--wav_path 'assets/samples/WRA_LamarAlexander_000/WRA_LamarAlexander_000.wav' \
--mp4_original_path 'assets/samples/RD_Radio35_000/RD_Radio35_000.mp4' \
--denoising_step 20 \
--saved_name 'one_shot_pred.mp4' \
--device 'cuda:0'
Results:
You can view it on assets/samples_results/one_shot_pred.mp4.
2. Few-shot
Take a segment of the video as reference frames (the code here takes the first 3 seconds of the video, totaling 75 frames) to be used as the input for the reference lip latent in the second stage. (More reference frames can provide finer lip details, but may be limited by the training data, resulting in an effect similar to one-shot in this repo)
python demo.py \
--video_inference \
--stage1_checkpoint_path 'assets/checkpoints/stage1_state_dict.ckpt' \
--stage2_checkpoint_path 'assets/checkpoints/stage2_state_dict.ckpt' \
--saved_path 'assets/samples/RD_Radio30_000/' \
--hubert_feat_path 'assets/samples/WRA_LamarAlexander_000/WRA_LamarAlexander_000.npy' \
--wav_path 'assets/samples/WRA_LamarAlexander_000/WRA_LamarAlexander_000.wav' \
--mp4_original_path 'assets/samples/RD_Radio35_000/RD_Radio35_000.mp4' \
--denoising_step 20 \
--saved_name 'few_shot_pred.mp4'\
--device 'cuda:0'
Results:
You can view it on assets/samples_results/few_shot_pred.mp4.
3. One-shot (Single Portrait)
This script is designed to test the effect of audio driving a single image, specifically for lip movement.
Unlike the scripts mentioned above, you need to disable video_inference
and modify reference_image_path
to the image you want to drive. mp4_original_path
will be ignored in this condition.
python demo.py \
--one_shot \
--stage1_checkpoint_path 'assets/checkpoints/stage1_state_dict.ckpt' \
--stage2_checkpoint_path 'assets/checkpoints/stage2_state_dict.ckpt' \
--saved_path 'assets/samples/RD_Radio30_000/' \
--hubert_feat_path 'assets/samples/RD_Radio32_000/RD_Radio32_000.npy' \
--wav_path 'assets/samples/RD_Radio32_000/RD_Radio32_000.wav' \
--denoising_step 20 \
--saved_name 'one_shot_portrait_pred.mp4'\
--device 'cuda:0' \
--reference_image_path 'assets/single_images/test001.png'
Results:
You can view it on assets/samples_results/one_shot_portrait_pred.mp4.
Notes:
test_demos/audios
at URL for your testing. python demo_with_batch.py
for random batch testing.This model is not a final product. Due to the model being trained only on HDTF, it has the following potential biases. Please pay special attention when using it:
TODOs
in the code. Please carefully check these sections for potential risks.@inproceedings{liu2024diffdub,
title={DiffDub: Person-Generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-Encoder},
author={Liu, Tao and Du, Chenpeng and Fan, Shuai and Chen, Feilong and Yu, Kai},
booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={3630--3634},
year={2024},
organization={IEEE}
}
diffae and espnet. We developed based on the above code, and we are very grateful for these excellent codes.
We also thank cpdu and azuredsky for their kindly help.
1. This library's code is not a formal product, and we have not tested all use cases; therefore, it cannot be directly offered to end-service customers.
2. The main purpose of making our code public is to facilitate academic demonstrations and communication. Any use of this code to spread harmful information is strictly prohibited.
3. Please use this library in compliance with the terms specified in the license file and avoid improper use.
4. When using the code, please follow and abide by local laws and regulations.
5. During the use of this code, you will bear the corresponding responsibility. We are not responsible for the generated results.
6. The materials on this page are for academic use only. Please do not use them for other purposes.