By Lele Chen , Ross K Maddox, Zhiyao Duan, Chenliang Xu.
University of Rochester.
This repository contains the original models (AT-net, VG-net) described in the paper Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss. The demo video is avaliable at https://youtu.be/eH7h_bDRX2Q. This code can be applied directly in LRW and GRID. The outputs from the model are visualized here: the first one is the synthesized landmark from ATnet, the rest of them are attention, motion map and final results from VGnet.
If you use any codes, models or the ideas from this repo in your research, please cite:
@inproceedings{chen2019hierarchical,
title={Hierarchical cross-modal talking face generation with dynamic pixel-wise loss},
author={Chen, Lele and Maddox, Ross K and Duan, Zhiyao and Xu, Chenliang},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={7832--7841},
year={2019}
}
This code is tested under Python 2.7. The model we provided is trained on LRW. However, it works fine on GRID,VOXCELB and other datasets. You can directly compare this model on other dataset with your own model. We treat this as fair comparison.
Pytorch environment:Pytorch 0.4.1. (conda install pytorch=0.4.1 torchvision cuda90 -c pytorch)
Install requirements.txt (pip install -r requirement.txt)
Download the pretrained ATnet and VGnet weights at google drive. Put the weights under model
folder.
Run the demo code: python demo.py
-device_ids
: gpu id-cuda
: using cuda or not-vg_model
: pretrained VGnet weight-at_model
: pretrained ATnet weight-lstm
: use lstm or not-p
: input example image-i
: input audio file-lstm
: use lstm or not-sample_dir
: folder to save the outputsDownload and unzip the training data from LRW
Preprocess the data (Extract landmark and crop the image by dlib).
Train the ATnet model: python atnet.py
-device_ids
: gpu id-batch_size
: batch size -model_dir
: folder to save weights-lstm
: use lstm or not-sample_dir
: folder to save visualized images during trainingTest the model: python atnet_test.py
-device_ids
: gpu id-batch_size
: batch size-model_name
: pretrained weights-sample_dir
: folder to save the outputs-lstm
: use lstm or notTrain the VGnet: python vgnet.py
-device_ids
: gpu id-batch_size
: batch size -model_dir
: folder to save weights-sample_dir
: folder to save visualized images during trainingTest the VGnet: python vgnet_test.py
-device_ids
: gpu id-batch_size
: batch size-model_name
: pretrained weights-sample_dir
: folder to save the outputsOverall ATVGnet
Regresssion based discriminator network
Result visualization on different datasets:
Reuslt compared with other SOTA methods:
The studies on image robustness respective with landmark accuracy:
Quantitative results:
There is another intresting and useful research on audio to landmark genration. Please check it out at https://github.com/eeskimez/Talking-Face-Landmarks-from-Speech.
MIT