Don't hesitate to ⭐ the repo if you enjoy our work !
We developped a multimodal emotion recognition platform to analyze the emotions of job candidates, in partnership with the French Employment Agency.
We analye facial, vocal and textual emotions, using mostly deep learning based approaches. We deployed a web app using Flask :
The tool can be accessed from the WebApp repository, by installing the requirements and launching main.py
.
We have also written a paper on our work : https://www.overleaf.com/read/xvtrrfpvzwhf
In this project, we are exploring state of the art models in multimodal sentiment analysis. We have chosen to explore text, sound and video inputs and develop an ensemble model that gathers the information from all these sources and displays it in a clear and interpretable way.
Affective computing is a field of Machine Learning and Computer Science that studies the recognition and the processing of human affects. Multimodal Emotion Recognition is a relatively new discipline that aims to include text inputs, as well as sound and video. This field has been rising with the development of social network that gave researchers access to a vast amount of data.
We have chosen to diversify the data sources we used depending on the type of data considered. All data sets used are free of charge and can be directly downloaded.
Modality | Data | Processed Data (for training) | Pre-trained Model | Colab Notebook | Other |
---|---|---|---|---|---|
Text | here | X-train y-train X-test y-test | Weights Model | --- | --- |
Audio | here | X-train y-train X-test y-test | Weights Model | Colab Notebook | --- |
Video | here | X-train y-train X-test y-test | Weights Model | Colab Notebook | Face Detect Model |
Our aim is to develop a model able to provide a live sentiment analysis with a visual user interface.Therefore, we have decided to separate two types of inputs :
The text-based personality recognition pipeline has the following structure :
We have chosen a neural network architecture based on both one-dimensional convolutional neural networks and recurrent neural networks. The one-dimensional convolution layer plays a role comparable to feature extraction : it allows finding patterns in text data. The Long-Short Term Memory cell is then used in order to leverage on the sequential nature of natural language : unlike regular neural network where inputs are assumed to be independent of each other, these architectures progressively accumulate and capture information through the sequences. LSTMs have the property of selectively remembering patterns for long durations of time. Our final model first includes 3 consecutive blocks consisting of the following four layers : one-dimensional convolution layer - max pooling - spatial dropout - batch normalization. The numbers of convolution filters are respectively 128, 256 and 512 for each block, kernel size is 8, max pooling size is 2 and dropout rate is 0.3. Following the three blocks, we chose to stack 3 LSTM cells with 180 outputs each. Finally, a fully connected layer of 128 nodes is added before the last classification layer.
The speech emotion recognition pipeline was built the following way :
The model we have chosen is a Time Distributed Convolutional Neural Network.
The main idea of a Time Distributed Convolutional Neural Network is to apply a rolling window (fixed size and time-step) all along the log-mel-spectrogram. Each of these windows will be the entry of a convolutional neural network, composed by four Local Feature Learning Blocks (LFLBs) and the output of each of these convolutional networks will be fed into a recurrent neural network composed by 2 cells LSTM (Long Short Term Memory) to learn the long-term contextual dependencies. Finally, a fully connected layer with softmax activation is used to predict the emotion detected in the voice.
To limit overfitting, we tuned the model with :
The video processing pipeline was built the following way :
The model we have chosen is an XCeption model, since it outperformed the other approaches we developed so far. We tuned the model with :
As you might have understood, the aim was to limit overfitting as much as possible in order to obtain a robust model.
The XCeption architecture is based on DepthWise Separable convolutions that allow to train much fewer parameters, and therefore reduce training time on Colab's GPUs to less than 90 minutes.
When it comes to applying CNNs in real life application, being able to explain the results is a great challenge. We can indeed plot class activation maps, which display the pixels that have been activated by the last convolution layer. We notice how the pixels are being activated differently depending on the emotion being labeled. The happiness seems to depend on the pixels linked to the eyes and mouth, whereas the sadness or the anger seem for example to be more related to the eyebrows.
The ensemble model has not been implemented on this version.
There are several resources available :
To use the web app :
python app.py
If you are interested in the research paper we are working on currently, feel free to check out this link : https://www.overleaf.com/read/xvtrrfpvzwhf
Anatoli-deBRADKE 💻 |
maelfabien 💻 |
RaphaelLederman 💻 |
STF-R 💻 |