Add "Crowd-sourced Emotional Multimodal Actors CREMA-D" Dataset

HAMZA310 commented 3 years ago

🚀 Feature

CREMA-D is an audio-visual data set for emotion recognition coming from a variety of races and ethnicities.
The data set consists of facial and vocal emotional expressions in sentences spoken in a range of basic emotional states (happy, sad, anger, fear, disgust, and neutral).
The dataset consists of 7,442 clips of 91 actors (48 male and 43 female).
torchaudio should contain the audio stream from the original audio-visual recording.

Motivation

With the addition of CREMA-D, torchaudio will have a dataset with sufficient race and ethnic diversity.
Researchers are using this dataset to uncover questions concerning the audio-visual perception of emotion.
As @vincentqb mentioned here, torchaudio should have representative datasets that could serve as templates for users, the addition of CREMA-D to torchaudio would be a step forward in that direction. CREMA-D is publicly available on git-lfs as preprocessed WAV files instead of single compressed file (e.g. Tar gz file).

Pitch

I have recently used this dataset in PyTorch in one of my projects. Making sure the implementation follows the recent template changes, I'd be happy to open a pull request.

Additional context

This is the reference paper for CREMA-D.

mthrok commented 3 years ago

Hi @HAMZA310

Thanks for the suggestion. The proposal sounds good to me. One question regarding the multimodal part; how do you propose to handle video in torchaudio? torchvision has ffmpeg binding that, I believe, can handle both audio and video. But torchaudio does not have such capability at the moment. So I am wondering this could better fit in torchvision.

cc @fmassa @datumbox @NicolasHug

HAMZA310 commented 3 years ago

Hi @mthrok

Thanks for your response. About the multimodal part, I'm proposing to add only the audio stream in torchaudio similar to this release. The audio stream in CREMA-D is not dependent on the visual recordings by any means and could be considered a standalone audio dataset.

pytorch / audio