pliang279 / MultiBench

[NeurIPS 2021] Multiscale Benchmarks for Multimodal Representation Learning
MIT License
478 stars 68 forks source link

Questions about the video encodings of the mosi and mosei datasets #23

Closed srb-cv closed 2 years ago

srb-cv commented 2 years ago

Thank you for writing a brilliant paper and a convenient code repository to reproduce the results. I have gone through the repo and the paper, but I still have questions about the implemented datasets and dataloaders.

Could you please lend some time to elucidate the following questions about the datasets?

  1. For MOSEI dataset, the encodings for a datapoint are of size 713. I can understand that these features are obtained from OpenFace and Facet libraries, but could you tell us which component/indices in the encodings are obtained from where ?

  2. For the MOSI dataset, the encodings are only of size 35. It seems there are only the Facet features provided for the dataset. Is there a reason why other (OpenFace) features are not used/provided as in Mosei ?

  3. Are you fine-tuning the training data of MOSI/MOSEI to obtain the video encodings?

Thank you again for your efforts. Your answers would save many hours banging our heads around the code.

Vanvan2017 commented 2 years ago

Hey, thanks for your interest in the video data.

  1. For the encodings, you can refer to the OpenFace, and there are also some tutorials and explanations.

  2. For experiments, the Facet features are most commonly used with a size of 35 (refer to previous work), you can find all the features here of mosi.

  3. I think all the features are just got from the toolkits such as Facet or OpenFace, I know these are out-of-date methods compared to the fine-tuning method these days, so you are welcome to use some state-of-the-art method to deal with the raw video!