Shuai Tan1,
Bin Ji1,
Mengxiao Bi2,
Ye Pan1,
1Shanghai Jiao Tong University
2NetEase Fuxi AI Lab
ECCV 2024 Oral
Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal inputsโboth aspects often neglected in existing methods. To address this gap, this paper proposes a novel Efficient Disentanglement framework for Talking head generation (EDTalk). Our framework enables individual manipulation of mouth shape, head pose, and emotional expression, conditioned on both video and audio inputs. Specifically, we employ three lightweight modules to decompose the facial dynamics into three distinct latent spaces representing mouth, pose, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions. To ensure independence and accelerate training, we enforce orthogonality among bases and devise an efficient training strategy to allocate motion responsibilities to each space without relying on external knowledge. The learned bases are then stored in corresponding banks, enabling shared visual priors with audio input. Furthermore, considering the properties of each space, we propose Audio-to-Motion module for audio-driven talking head synthesis. Experiments are conducted to demonstrate the effectiveness of EDTalk.
We train and test based on Python 3.8 and Pytorch. To install the dependencies run:
git clone https://github.com/tanshuai0219/EDTalk.git
cd EDTalk
conda create -n EDTalk python=3.8
conda activate EDTalk
python packages
pip install -r requirements.txt
python packages for Windows
pip install -r requirements_windows.txt
Thanks to nitinmukesh for providing a Windows 11 installation tutorial and welcome to follow his channel!
Launch gradio interface (Thank the contributor: newgenai79!)
python webui_emotions.py
Download the checkpoints/huggingface link and put them into ./ckpts.
[ไธญๆ็จๆท] ๅฏไปฅ้่ฟ่ฟไธช้พๆฅไธ่ฝฝๆ้ใ
python demo_EDTalk_A_using_predefined_exp_weights.py --source_path path/to/image --audio_driving_path path/to/audio --pose_driving_path path/to/pose --exp_type type/of/expression --save_path path/to/save
# example:
python demo_EDTalk_A_using_predefined_exp_weights.py --source_path res/results_by_facesr/demo_EDTalk_A.png --audio_driving_path test_data/mouth_source.wav --pose_driving_path test_data/pose_source1.mp4 --exp_type angry --save_path res/demo_EDTalk_A_using_weights.mp4
python demo_EDTalk_A.py --source_path path/to/image --audio_driving_path path/to/audio --pose_driving_path path/to/pose --exp_driving_path path/to/expression --save_path path/to/save
# example:
python demo_EDTalk_A.py --source_path res/results_by_facesr/demo_EDTalk_A.png --audio_driving_path test_data/mouth_source.wav --pose_driving_path test_data/pose_source1.mp4 --exp_driving_path test_data/expression_source.mp4 --save_path res/demo_EDTalk_A.mp4
The result will be stored in save_path.
Source_path and videos used must be first cropped using scripts crop_image2.py (download shape_predictor_68_face_landmarks.dat and put it in ./data_preprocess dir) and crop_video.py. Make sure the every video' frame rate must be 25 fps
You can also use crop_image.py to crop the image, but increase_ratio has to be carefully set and tried several times to get the optimal result.
python demo_lip_pose.py --fix_pose --source_path path/to/image --audio_driving_path path/to/audio --save_path path/to/save
# example:
python demo_lip_pose.py --fix_pose --source_path test_data/identity_source.jpg --audio_driving_path test_data/mouth_source.wav --save_path res/demo_lip_pose_fix_pose.mp4
python demo_lip_pose.py --source_path path/to/image --audio_driving_path path/to/audio --pose_driving_path path/to/pose --save_path path/to/save
# example:
python demo_lip_pose.py --source_path test_data/identity_source.jpg --audio_driving_path test_data/mouth_source.wav --pose_driving_path test_data/pose_source1.mp4 --save_path res/demo_lip_pose_fix_pose.mp4
Source Img | EDTalk | EDTalk + liveprotrait |
---|---|---|
python demo_lip_pose_V.py --source_path path/to/image --audio_driving_path path/to/audio --lip_driving_path path/to/mouth --pose_driving_path path/to/pose --save_path path/to/save
# example:
python demo_lip_pose_V.py --source_path res/results_by_facesr/demo_lip_pose5.png --audio_driving_path test_data/mouth_source.wav --lip_driving_path test_data/mouth_source.mp4 --pose_driving_path test_data/pose_source1.mp4 --save_path demo_lip_pose_V.mp4
Source Img | demo_lip_pose_V Results | + FaceSR |
---|---|---|
python demo_change_a_video_lip.py --source_path path/to/video --audio_driving_path path/to/audio --save_path path/to/save
# example
python demo_change_a_video_lip.py --source_path test_data/pose_source1.mp4 --audio_driving_path test_data/mouth_source.wav --save_path res/demo_change_a_video_lip.mp4
Source Img | results #1 | results #2 |
---|---|---|
python demo_EDTalk_V.py --source_path path/to/image --lip_driving_path path/to/lip --audio_driving_path path/to/audio --pose_driving_path path/to/pose --exp_driving_path path/to/expression --save_path path/to/save
# example:
python demo_EDTalk_V.py --source_path test_data/identity_source.jpg --lip_driving_path test_data/mouth_source.mp4 --audio_driving_path test_data/mouth_source.wav --pose_driving_path test_data/pose_source1.mp4 --exp_driving_path test_data/expression_source.mp4 --save_path res/demo_EDTalk_V.mp4
The result will be stored in save_path.
โบ๏ธ๐ Thanks to Tao Liu for the proposal~
The purpose is to upscale the resolution from 256 to 512 and address the issue of blurry rendering.
Please install addtional environment here:
pip install facexlib
pip install tb-nightly -i https://mirrors.aliyun.com/pypi/simple
pip install gfpgan
Then enable the option --face_sr
in your scripts. The first time will download the weights of gfpgan (you can optionally first download gfpgan ckpts and put them in gfpgan/weights dir).
Here are some examples:
python demo_lip_pose.py --source_path path/to/image --audio_driving_path path/to/audio --pose_driving_path path/to/pose --save_path path/to/save --face_sr
python demo_EDTalk_V.py --source_path path/to/image --lip_driving_path path/to/lip --audio_driving_path path/to/audio --pose_driving_path path/to/pose --exp_driving_path path/to/expression --save_path path/to/save --face_sr
python demo_EDTalk_A_using_predefined_exp_weights.py --source_path path/to/image --audio_driving_path path/to/audio --pose_driving_path path/to/pose --exp_type type/of/expression --save_path path/to/save --face_sr
Source Img | EDTalk Results | EDTalk + FaceSR |
---|---|---|
There are a few issues currently, I'll be checking them carefully. Please be patient! Note: We take Obama and the path in my computer (/data/ts/xxxxxx) as example and you should replace it with your own path:
Download the Obama data from AD-Nerf and put it in '/data/ts/datasets/person_specific_dataset/AD-NeRF/video/Obama.mp4'
Crop video and resample as 25 fps:
python data_preprocess/crop_video.py --inp /data/ts/datasets/person_specific_dataset/AD-NeRF/video/Obama.mp4 --outp /data/ts/datasets/person_specific_dataset/AD-NeRF/video_crop/Obama.mp4
Save video as frames:
ffmpeg -i /data/ts/datasets/person_specific_dataset/AD-NeRF/video_crop/Obama.mp4 -r 25 -f image2 /data/ts/datasets/person_specific_dataset/AD-NeRF/video_crop_frame/Obama/%4d.png
Start training:
python train_fine_tune.py --datapath /data/ts/datasets/person_specific_dataset/AD-NeRF/video_crop_frame/Obama --only_fine_tune_dec
Change datapath as your own data. only_fine_tune_dec means only training dec module. In my experience, training only dec can help with image quality, so we recommend it. You also can set it as False, and it means to fune tune full model. You should go through the saved samples (at exp_path/exp_name/checkpoint and in my case, at: /data/ts/checkpoints/EDTalk/fine_tune/Obama/checkpoint) frequently to find the optimal model in time.
Step #0 | Step #100 | Step #200 |
---|---|---|
First line is source image, second line is driving image, and third line is generated results.
We hope more people can get involved, and we will promptly handle pull requests. Currently, there are still some tasks that need assistance, such as creating a colab notebook, web UI, and translation work, among others.
[ICCV 23] EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation
[AAAI 24] Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style
[AAAI 24] Say Anything with Any Style
@inproceedings{tan2024edtalk,
title = {EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis},
author = {Tan, Shuai and Ji, Bin and Bi, Mengxiao and Pan, Ye},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024}
}
Some code are borrowed from following projects:
Some figures in the paper is inspired by:
Thanks for these great projects.