sbelharbi / interpretable-fer-aus

Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues
MIT License
6 stars 1 forks source link
action-units class-activation-maps facial-action-coding-system facial-expression-recognition interpretability pytorch pytorch-implementation weakly-supervised-object-localization

Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues (FG2024)

by Soufiane Belharbi1, Marco Pedersoli1, Alessandro Lameiras Koerich1, Simon Bacon2, Eric Granger1

1 LIVIA, Dept. of Systems Engineering, ÉTS, Montreal, Canada
2 Dept. of Health, Kinesiology \& Applied Physiology, Concordia University, Montreal, Canada

arXiv Hugging Face Spaces

outline

[Spotlight] [Poster]

Abstract

Although state-of-the-art classifiers for facial expression recognition (FER) can achieve a high level of accuracy, they lack interpretability, an important feature for end-users. Experts typically associate spatial action units (AU) from a codebook to facial regions for the visual interpretation of expressions. In this paper, the same expert steps are followed. A new learning strategy is proposed to explicitly incorporate AU cues into classifier training, allowing to train deep interpretable models. During training, this AU codebook is used, along with the input image expression label, and facial landmarks, to construct a AU heatmap that indicates the most discriminative image regions of interest w.r.t the facial expression. This valuable spatial cue is leveraged to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with AU heatmaps. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with AU maps, simulating the expert decision process. Our strategy only relies on image class expression for supervision, without additional manual annotations. Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time. Our extensive evaluation on two public benchmarks RAFDB, and AFFECTNET datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on class activation mapping (CAM) methods, and show that our approach can also improve CAM interpretability.

Code: Pytorch 2.0.0

Citation:

@InProceedings{belharbi24-fer-aus,
  title={Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues},
  author={Belharbi, S. and Pedersoli, M. and Koerich, A. L. and Bacon, S. and Granger, E.},
  booktitle={International Conference on Automatic Face and Gesture Recognition},
  year={2024}
}

Content:

Install:

# Create a virtual env. env with conda
./create_env_conda.sh NAME_OF_THE_VIRTUAL_ENV

Download datasets :

Download RAF-DB and AFFECTNET dataset.

Once you download the datasets, you need to adjust the paths in get_root_wsol_dataset().

Data preparation :

Run code :

Pretrained weights (evaluation) :

We provide the weights for all the models (44 weights: 2 datasets (RAF-DB, AffecNet) x 11 methods x 2 [with/without AUs]). Weights can be found at Hugging Face in the file shared-trained-models.tar.gz. To run a single case:

python eval.py --cudaid 0 --split test --checkpoint_type best --exp_path $rootdir/shared-trained-models/FG_FER/AffectNet/resnet50/STD_CL/CAM/align_atten_to_heatmap_True/AffectNet-resnet50-CAM-WGAP-cp_best-boxv2_False

To run all 44 cases:

./eval_all.sh 0

To evaluate a single image only, you can use:

python single_img_eval.py --cudaid 0 --checkpoint_type best --exp_path $rootdir/shared-trained-models/FG_FER/RAF-DB/resnet50/STD_CL/CAM/align_atten_to_heatmap_True/RAF-DB-resnet50-CAM-WGAP-cp_best-boxv2_False

The provided weights can be used to reproduce the classification and localization performance reported in the paper in this table:

outline

We also provide the folds and the facial landmarks in Hugging Face in the file folds.tar.gz. For RAF-DB dataset, you need to crop and align the dataset using this code (see above in the readme) so the facial landmarks match. For AffectNet, you can use the provided version of the dataset.

Decompress both files into the root of this repository.