Closed srb-cv closed 2 years ago
Hey, thanks for your interest in the video data.
For the encodings, you can refer to the OpenFace, and there are also some tutorials and explanations.
For experiments, the Facet features are most commonly used with a size of 35 (refer to previous work), you can find all the features here of mosi.
I think all the features are just got from the toolkits such as Facet or OpenFace, I know these are out-of-date methods compared to the fine-tuning method these days, so you are welcome to use some state-of-the-art method to deal with the raw video!
Thank you for writing a brilliant paper and a convenient code repository to reproduce the results. I have gone through the repo and the paper, but I still have questions about the implemented datasets and dataloaders.
Could you please lend some time to elucidate the following questions about the datasets?
For MOSEI dataset, the encodings for a datapoint are of size 713. I can understand that these features are obtained from OpenFace and Facet libraries, but could you tell us which component/indices in the encodings are obtained from where ?
For the MOSI dataset, the encodings are only of size 35. It seems there are only the Facet features provided for the dataset. Is there a reason why other (OpenFace) features are not used/provided as in Mosei ?
Are you fine-tuning the training data of MOSI/MOSEI to obtain the video encodings?
Thank you again for your efforts. Your answers would save many hours banging our heads around the code.