LivePortraitTalker

LivePortraitTalker is a zero-shot talking head generation approach. It combines the pretrained models of SadTalker and LivePortrait. Novelty of this repo;

Training the mapping network of Sadtalker for LivePortrait rendering networks.
This repo propose synthetic head pose generation which uses the inital head pose's and mappingnet outputs.

This is just Proof of Concept of the approach, the model is only trained on 2% of the VoxCeleb2 dataset.

showcase
Outputs of LivePortraitTalker

Introduction

LivePortraitTalker Architecture
Model Diagram

The pretrained models in the green boxes are from Sadtalker, the red boxes are from LivePortrait repository. The MappintNet architecture in the purple box is taken from Sadtalker and trained. The VoxCeleb2 dataset was used to train MappingNet. Due to GPU prices, the model was trained using approximately 2000 videos (<2% of the dataset). Therefore, the results may not be consistent and high quality. However, this work proves the concept.

Installation

Python 3.9+
Install PyTorch 2.3.0, you should install compatible version with your system requirements. You can find PyTorch 2.3.0 versions here
pip install -r requirements.txt
Don't forget to change device type from config file. You need to set the inference.device to specify the location where the model will run: use cuda for GPU, cpu for CPU, and mps for MacBook Silicon.
Sadtalker and LivePortrait pretrained models must be downloaded from their repository. MappingNet can be downloaded from here or you can run following command to install pretrained models automatically:

sh scripts/download_models.sh

Inference

There are couple of options to generate talking head; synthetic head pose generation, reference head pose, still, video2video, pupil control.

Synthetic Head Pose Generation

Most talking head papers, such as SadTalker, generate head poses from the input audio. However, I do not think that head poses have a common features with audio. Therefore, I proposed Synthetic Head Pose Generation without using audio. This approach can generate head poses more naturally then previous approaches. I will give more information about Synthetic Head Pose Generation in next sections.

python inference.py --config_path config.yaml --source_path <path/to/source/image> --audio_path <path/to/audio> --save_path <path/to/save/folder>

Reference Head Pose

This option takes reference video as a input and generates talking head using poses of the person from the reference video. Once reference video is processed, head poses are saved to be used for next generation to increase inference speed. In some cases input audio and the reference head poses can be irrelevant, therefore should be used with more stable reference head poses.

python inference.py --config_path config.yaml --source_path <path/to/source/image> --audio_path <path/to/audio> --save_path <path/to/save/folder> --ref_head_pose_path <path/to/reference/video>

This pipeline select the initial head pose frame randomly, ref_frames_from_zero can be added to set the initial frame to 0;

python inference.py --config_path config.yaml --source_path <path/to/source/image> --audio_path <path/to/audio> --save_path <path/to/save/folder> --ref_head_pose_path <path/to/reference/video> --ref_frames_from_zero

Still

There is no head movements in this option. Only lips and blinks are generated.

python inference.py --config_path config.yaml --source_path <path/to/source/image> --audio_path <path/to/audio> --save_path <path/to/save/folder> --still

Video2Video

If the video ise given as a source_path. The repository generates the lips using audio, while providing the head poses as the original frame.

python inference.py --config_path config.yaml --source_path <path/to/source/video> --audio_path <path/to/audio> --save_path <path/to/save/folder>

Pupil Control

Unlike Sadtalker, this repository predicts only lip expressions. Therefore, other facial expression are taken from the source image. This can be problematic if the eyes in the source image are not looking directly at the camera. Thanks to the ComfyUI-AdvancedLivePortrait, pupils can be aranged.

python inference.py --config_path config.yaml --source_path <path/to/source/image> --audio_path <path/to/audio> --save_path <path/to/save/folder> --pupil_x <pupil/x/number> --pupil_y <pupil/y/number>

Head Pose Generation

will be updated

mkara44 / liveportrait_talker

readme