LivePortraitTalker is a zero-shot talking head generation approach. It combines the pretrained models of SadTalker and LivePortrait. Novelty of this repo;
This is just Proof of Concept of the approach, the model is only trained on 2% of the VoxCeleb2 dataset.
Outputs of LivePortraitTalker
Model Diagram
The pretrained models in the green boxes are from Sadtalker, the red boxes are from LivePortrait repository. The MappintNet architecture in the purple box is taken from Sadtalker and trained. The VoxCeleb2 dataset was used to train MappingNet. Due to GPU prices, the model was trained using approximately 2000 videos (<2% of the dataset). Therefore, the results may not be consistent and high quality. However, this work proves the concept.
pip install -r requirements.txt
inference.device
to specify the location where the model will run: use cuda
for GPU, cpu
for CPU, and mps
for MacBook Silicon.sh scripts/download_models.sh
There are couple of options to generate talking head; synthetic head pose generation, reference head pose, still, video2video, pupil control.
Most talking head papers, such as SadTalker, generate head poses from the input audio. However, I do not think that head poses have a common features with audio. Therefore, I proposed Synthetic Head Pose Generation without using audio. This approach can generate head poses more naturally then previous approaches. I will give more information about Synthetic Head Pose Generation in next sections.
python inference.py --config_path config.yaml --source_path <path/to/source/image> --audio_path <path/to/audio> --save_path <path/to/save/folder>
This option takes reference video as a input and generates talking head using poses of the person from the reference video. Once reference video is processed, head poses are saved to be used for next generation to increase inference speed. In some cases input audio and the reference head poses can be irrelevant, therefore should be used with more stable reference head poses.
python inference.py --config_path config.yaml --source_path <path/to/source/image> --audio_path <path/to/audio> --save_path <path/to/save/folder> --ref_head_pose_path <path/to/reference/video>
This pipeline select the initial head pose frame randomly, ref_frames_from_zero
can be added to set the initial frame to 0;
python inference.py --config_path config.yaml --source_path <path/to/source/image> --audio_path <path/to/audio> --save_path <path/to/save/folder> --ref_head_pose_path <path/to/reference/video> --ref_frames_from_zero
There is no head movements in this option. Only lips and blinks are generated.
python inference.py --config_path config.yaml --source_path <path/to/source/image> --audio_path <path/to/audio> --save_path <path/to/save/folder> --still
If the video ise given as a source_path
. The repository generates the lips using audio, while providing the head poses as the original frame.
python inference.py --config_path config.yaml --source_path <path/to/source/video> --audio_path <path/to/audio> --save_path <path/to/save/folder>
Unlike Sadtalker, this repository predicts only lip expressions. Therefore, other facial expression are taken from the source image. This can be problematic if the eyes in the source image are not looking directly at the camera. Thanks to the ComfyUI-AdvancedLivePortrait, pupils can be aranged.
python inference.py --config_path config.yaml --source_path <path/to/source/image> --audio_path <path/to/audio> --save_path <path/to/save/folder> --pupil_x <pupil/x/number> --pupil_y <pupil/y/number>
will be updated