Question about mosi datasets' preprocessing

jedyang97 / MTAG

Code for NAACL 2021 paper: MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences

MIT License

43 stars 8 forks source link

Hi @sunjieemm, thanks for your interest in MTAG!

We were using data pre-processed by the Tensor Fusion paper. I took a quick look and in page 4 of the paper, the authors described briefly about pre-processing:

A set of 20 Facial Action Units (Ekman et al., 1980), indicating detailed muscle movements on the face, are also extracted using FACET

This is likely what the 20 in the visual modality means.

As for the acoustic modality, the authors also have a detailed paragraph explaining how it is extracted with COVAREP. That paragraph is a bit long so I will point to the original paper for more detailed information.

Hope this helps!

jedyang97 / MTAG

Question about mosi datasets' preprocessing #5