[video-based training] Define and implement a basic architecture

bhigy commented 3 years ago

Possible sources of inspiration for the visual part: [1] https://arxiv.org/abs/2006.09199 [2] https://openaccess.thecvf.com/content_ICCV_2019/html/Miech_HowTo100M_Learning_a_Text-Video_Embedding_by_Watching_Hundred_Million_Narrated_ICCV_2019_paper.html [3] https://openaccess.thecvf.com/content_CVPR_2020/html/Miech_End-to-End_Learning_of_Visual_Representations_From_Uncurated_Instructional_Videos_CVPR_2020_paper.html

bhigy commented 3 years ago

For the visual part:

[1] and [2] use features from 2D and 3D CNNs + temporal max-pooling.
[3] uses I3D/S3D features + global mean-pooling [+ linear transformation] -> I would go for something similar to this, at least for a first attempt, as it is easy to implement. Once it works, we can consider more complex approaches.

In both [1] and [2], a non-linear gating mechanism is applied on the vectors obtained from each modality. [3] uses a special training loss that compensates for misalignments.

bhigy commented 3 years ago

You can find the code from [1] here: https://github.com/roudimit/AVLnet.

spokenlanguage / platalea

[video-based training] Define and implement a basic architecture #77