spokenlanguage / platalea

Library for training visually-grounded models of spoken language understanding.
Apache License 2.0
3 stars 1 forks source link

[video-based training] Define and implement a basic architecture #77

Open bhigy opened 3 years ago

bhigy commented 3 years ago

Possible sources of inspiration for the visual part: [1] https://arxiv.org/abs/2006.09199 [2] https://openaccess.thecvf.com/content_ICCV_2019/html/Miech_HowTo100M_Learning_a_Text-Video_Embedding_by_Watching_Hundred_Million_Narrated_ICCV_2019_paper.html [3] https://openaccess.thecvf.com/content_CVPR_2020/html/Miech_End-to-End_Learning_of_Visual_Representations_From_Uncurated_Instructional_Videos_CVPR_2020_paper.html

bhigy commented 3 years ago

For the visual part:

In both [1] and [2], a non-linear gating mechanism is applied on the vectors obtained from each modality. [3] uses a special training loss that compensates for misalignments.

bhigy commented 3 years ago

You can find the code from [1] here: https://github.com/roudimit/AVLnet.