leohku / faceformer-emo

FaceFormer Emo: Speech-Driven 3D Facial Animation with Emotion Embedding
MIT License
21 stars 1 forks source link

FaceFormer Emo

Abstract

We propose FaceFormer Emo, a Transformer-based speech to 3D face mesh model that produces highly emotional expressions from the input audio.

This work extends FaceFormer in producing realistic and expressive outputs without a large loss in lip vertex accuracy, and introduces a novel lip vertex loss function which increases the weights of the vertices nearby the lips to increase accuracy and emotion dynamics.

Poster

3-min Video Introduction

[YouTube Link]

FaceFormer Emo Introduction Video

FaceFormer

PyTorch implementation for the paper:

FaceFormer: Speech-Driven 3D Facial Animation with Transformers, CVPR 2022.

Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura

[Paper] [Project Page]

Given the raw audio input and a neutral 3D face mesh, our proposed end-to-end Transformer-based architecture, FaceFormer, can autoregressively synthesize a sequence of realistic 3D facial motions with accurate lip movements.

Environment

Dependencies

Data

VOCASET

Request the VOCASET data from https://voca.is.tue.mpg.de/. Place the downloaded files data_verts.npy, raw_audio_fixed.pkl, templates.pkl and subj_seq_to_idx.pkl in the folder VOCASET. Download "FLAME_sample.ply" from voca and put it in VOCASET/templates.

BIWI

Request the BIWI dataset from Biwi 3D Audiovisual Corpus of Affective Communication. The dataset contains the following subfolders:

Place the folders 'faces' and 'rigid_scans' in BIWI and place the wav files in BIWI/wav.

Demo

Download the pretrained models from biwi.pth and vocaset.pth. Put the pretrained models under BIWI and VOCASET folders, respectively. Given the audio signal,

Training and Testing on VOCASET

Data Preparation

Training and Testing

(WARNING: The snippet below might exhaust all memory on GPU, use 5 train_subjects for safe measure)

Visualization

Acknowledgement

We gratefully acknowledge ETHZ-CVL for providing the B3D(AC)2 database and MPI-IS for releasing the VOCASET dataset. The implementation of wav2vec2 is built upon huggingface-transformers, and the temporal bias is modified from ALiBi. We use MPI-IS/mesh for mesh processing and VOCA/rendering for rendering. We thank the authors for their excellent works. Any third-party packages are owned by their respective authors and must be used under their respective licenses.