jedyang97 / MTAG

Code for NAACL 2021 paper: MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences
MIT License
43 stars 8 forks source link

Question about mosi datasets' preprocessing #5

Closed sunjieemm closed 1 year ago

sunjieemm commented 1 year ago

Hi, I have a confusion about the mosi datasets' preprocessing. I found two versions of the mosi dataset. For t, a, v, one has dimension with [300,5,20] respectively, rather than another one has dimension with [300,74,47], whatever aligned or unaligned. ​As far as I know, MOSI audio and vision modality features extracted by COVAREP and Facet have 74 and 47 feature dimensions. So is there a different preprocessing? Looking forward to hearing from you!

jedyang97 commented 1 year ago

Hi @sunjieemm, thanks for your interest in MTAG!

We were using data pre-processed by the Tensor Fusion paper. I took a quick look and in page 4 of the paper, the authors described briefly about pre-processing:

A set of 20 Facial Action Units (Ekman et al., 1980), indicating detailed muscle movements on the face, are also extracted using FACET

This is likely what the 20 in the visual modality means.

As for the acoustic modality, the authors also have a detailed paragraph explaining how it is extracted with COVAREP. That paragraph is a bit long so I will point to the original paper for more detailed information.

Hope this helps!