Questions about the video encodings of the mosi and mosei datasets

pliang279 / MultiBench

[NeurIPS 2021] Multiscale Benchmarks for Multimodal Representation Learning

MIT License

478 stars 68 forks source link

Thank you for writing a brilliant paper and a convenient code repository to reproduce the results. I have gone through the repo and the paper, but I still have questions about the implemented datasets and dataloaders.

Could you please lend some time to elucidate the following questions about the datasets?

For MOSEI dataset, the encodings for a datapoint are of size 713. I can understand that these features are obtained from OpenFace and Facet libraries, but could you tell us which component/indices in the encodings are obtained from where ?
For the MOSI dataset, the encodings are only of size 35. It seems there are only the Facet features provided for the dataset. Is there a reason why other (OpenFace) features are not used/provided as in Mosei ?
Are you fine-tuning the training data of MOSI/MOSEI to obtain the video encodings?

Thank you again for your efforts. Your answers would save many hours banging our heads around the code.