thuiar / MMSA-FET

A Tool for extracting multimodal features from videos.
GNU General Public License v3.0
141 stars 20 forks source link
multimodal-deep-learning multimodal-sentiment-analysis

MMSA-Feature Extraction Toolkit

MMSA-Feature Extraction Toolkit extracts multimodal features for Multimodal Sentiment Analysis Datasets. It integrates several commonly used tools for visual, acoustic and text modality. The extracted features are compatible with the MMSA Framework and thus can be used directly. The tool can also extract features for single videos.

This work is included in the ACL-2022 DEMO paper: [M-SENA: An Integrated Platform for Multimodal Sentiment Analysis](). If you find our work useful, don't hesitate to cite our paper. Thank you!

@article{mao2022m,
  title={M-SENA: An Integrated Platform for Multimodal Sentiment Analysis},
  author={Mao, Huisheng and Yuan, Ziqi and Xu, Hua and Yu, Wenmeng and Liu, Yihe and Gao, Kai},
  journal={arXiv preprint arXiv:2203.12441},
  year={2022}
}

Features

1. Installation

MMSA-Feature Extraction Toolkit is available from PyPI. Due to package size limitation on PyPi, large model files cannot be shipped with the package. Users need to run a post install command to download these files manually. If you can't access Google Drive, please refer to this page for manual download.

# Install package from PyPI
$ pip install MMSA-FET
# Download models & libraries from Google Drive. Use --proxy if needed.
$ python -m MSA_FET install

Note: A few system-wide dependancies need to be installed manually. See Dependency Installation for more information.

2. Quick Start

MMSA-FET is fairly easy to use. You can either call API in python or use commandline interface. Below is a basic example using python APIs.

Note: To extract features for datasets, the datasets need to be organized in a specific file structure, and a label.csv file is needed. See Dataset and Structure for details. Raw video files and label files for MOSI, MOSEI and CH-SIMS can be downloaded from BaiduYunDisk code: mfet or Google Drive.

from MSA_FET import FeatureExtractionTool
from MSA_FET import run_dataset

# initialize with default librosa config which only extracts audio features
fet = FeatureExtractionTool("openface")

# alternatively initialize with a custom config file
fet = FeatureExtractionTool("custom_config.json")

# extract features for single video
feature1 = fet.run_single("input1.mp4")
print(feature1)
feature2 = fet.run_single("input2.mp4")

# extract for dataset & save features to file
run_dataset(
    config = "aligned",
    dataset_dir="~/MOSI", 
    out_file="output/feature.pkl",
    num_workers=4
)

The custom_config.json is the path to a custom config file, the format of which is introduced below.

For detailed usage, please read APIs and Command Line Arguments.

3. Config File

MMSA-FET comes with a few example configs which can be used like below.

# Each supported tool has an example config
fet = FeatureExtractionTool(config="aligned")
fet = FeatureExtractionTool(config="librosa")
fet = FeatureExtractionTool(config="opensmile")
fet = FeatureExtractionTool(config="wav2vec")
fet = FeatureExtractionTool(config="openface")
fet = FeatureExtractionTool(config="mediapipe")
fet = FeatureExtractionTool(config="bert")
fet = FeatureExtractionTool(config="roberta")

For customized features, you can:

  1. Edit the default configs and pass a dictionary to the config parameter like the example below:
from MSA_FET import FeatureExtractionTool, get_default_config

# here we only extract audio and video features
config_a = get_default_config('opensmile')
config_v = get_default_config('openface')

# modify default config
config_a['audio']['args']['feature_level'] = 'LowLevelDescriptors'

# combine audio and video configs
config = {**config_a, **config_v}

# initialize
fet = FeatureExtractionTool(config=config)
  1. Provide a config json file. The below example extracts features of all three modalities. To extract unimodal features, just remove unnecessary sections from the file.
{
  "audio": {
    "tool": "librosa",
    "sample_rate": null,
    "args": {
      "mfcc": {
        "n_mfcc": 20,
        "htk": true
      },
      "rms": {},
      "zero_crossing_rate": {},
      "spectral_rolloff": {},
      "spectral_centroid": {}
    }
  },
  "video": {
    "tool": "openface",
    "fps": 25,
    "average_over": 3,
    "args": {
      "hogalign": false,
      "simalign": false,
      "nobadaligned": false,
      "landmark_2D": true,
      "landmark_3D": false,
      "pdmparams": false,
      "head_pose": true,
      "action_units": true,
      "gaze": true,
      "tracked": false
    }
  },
  "text": {
    "model": "bert",
    "device": "cpu",
    "pretrained": "models/bert_base_uncased",
    "args": {}
  }
}

4. Supported Tools & Features

4.1 Audio Tools

4.2 Video Tools

4.3 Text Tools

4.4 Aligners