maelfabien / Multimodal-Emotion-Recognition

A real time Multimodal Emotion Recognition web app for text, sound and video inputs
Apache License 2.0
887 stars 291 forks source link
deep-learning emotion-analysis emotion-detection emotion-recognition emotions keras python real-time tensorflow

Real-Time Multimodal Emotion Recognition

GitHub followers GitHub contributors GitHub commit activity PyPI - Python Version

Don't hesitate to ⭐ the repo if you enjoy our work !

In a nutshell

We developped a multimodal emotion recognition platform to analyze the emotions of job candidates, in partnership with the French Employment Agency.

We analye facial, vocal and textual emotions, using mostly deep learning based approaches. We deployed a web app using Flask :

image

The tool can be accessed from the WebApp repository, by installing the requirements and launching main.py.

We have also written a paper on our work : https://www.overleaf.com/read/xvtrrfpvzwhf

Table of Content :

In this project, we are exploring state of the art models in multimodal sentiment analysis. We have chosen to explore text, sound and video inputs and develop an ensemble model that gathers the information from all these sources and displays it in a clear and interpretable way.

0. Technologies

image

I. Context

Affective computing is a field of Machine Learning and Computer Science that studies the recognition and the processing of human affects. Multimodal Emotion Recognition is a relatively new discipline that aims to include text inputs, as well as sound and video. This field has been rising with the development of social network that gave researchers access to a vast amount of data.

II. Data Sources

We have chosen to diversify the data sources we used depending on the type of data considered. All data sets used are free of charge and can be directly downloaded.

III. Download

Modality Data Processed Data (for training) Pre-trained Model Colab Notebook Other
Text here X-train y-train X-test y-test Weights Model --- ---
Audio here X-train y-train X-test y-test Weights Model Colab Notebook ---
Video here X-train y-train X-test y-test Weights Model Colab Notebook Face Detect Model

IV. Methodology

Our aim is to develop a model able to provide a live sentiment analysis with a visual user interface.Therefore, we have decided to separate two types of inputs :

a. Text Analysis

image

Pipeline

The text-based personality recognition pipeline has the following structure :

Model

We have chosen a neural network architecture based on both one-dimensional convolutional neural networks and recurrent neural networks. The one-dimensional convolution layer plays a role comparable to feature extraction : it allows finding patterns in text data. The Long-Short Term Memory cell is then used in order to leverage on the sequential nature of natural language : unlike regular neural network where inputs are assumed to be independent of each other, these architectures progressively accumulate and capture information through the sequences. LSTMs have the property of selectively remembering patterns for long durations of time. Our final model first includes 3 consecutive blocks consisting of the following four layers : one-dimensional convolution layer - max pooling - spatial dropout - batch normalization. The numbers of convolution filters are respectively 128, 256 and 512 for each block, kernel size is 8, max pooling size is 2 and dropout rate is 0.3. Following the three blocks, we chose to stack 3 LSTM cells with 180 outputs each. Finally, a fully connected layer of 128 nodes is added before the last classification layer.

image

b. Audio Analysis

image

Pipeline

The speech emotion recognition pipeline was built the following way :

Model

The model we have chosen is a Time Distributed Convolutional Neural Network.

The main idea of a Time Distributed Convolutional Neural Network is to apply a rolling window (fixed size and time-step) all along the log-mel-spectrogram. Each of these windows will be the entry of a convolutional neural network, composed by four Local Feature Learning Blocks (LFLBs) and the output of each of these convolutional networks will be fed into a recurrent neural network composed by 2 cells LSTM (Long Short Term Memory) to learn the long-term contextual dependencies. Finally, a fully connected layer with softmax activation is used to predict the emotion detected in the voice.

image

To limit overfitting, we tuned the model with :

c. Video Analysis

image

Pipeline

The video processing pipeline was built the following way :

Model

The model we have chosen is an XCeption model, since it outperformed the other approaches we developed so far. We tuned the model with :

As you might have understood, the aim was to limit overfitting as much as possible in order to obtain a robust model.

image

The XCeption architecture is based on DepthWise Separable convolutions that allow to train much fewer parameters, and therefore reduce training time on Colab's GPUs to less than 90 minutes.

image

When it comes to applying CNNs in real life application, being able to explain the results is a great challenge. We can indeed plot class activation maps, which display the pixels that have been activated by the last convolution layer. We notice how the pixels are being activated differently depending on the emotion being labeled. The happiness seems to depend on the pixels linked to the eyes and mouth, whereas the sadness or the anger seem for example to be more related to the eyebrows.

image

d. Ensemble Model

The ensemble model has not been implemented on this version.

image

V. How to use it ?

There are several resources available :

To use the web app :

VI. Research Paper

If you are interested in the research paper we are working on currently, feel free to check out this link : https://www.overleaf.com/read/xvtrrfpvzwhf

VII. Contributors

Anatoli-deBRADKE
Anatoli-deBRADKE

💻
mfix22
maelfabien

💻
mfix22
RaphaelLederman

💻
mfix22
STF-R

💻