Open bhigy opened 3 years ago
For the visual part:
In both [1] and [2], a non-linear gating mechanism is applied on the vectors obtained from each modality. [3] uses a special training loss that compensates for misalignments.
You can find the code from [1] here: https://github.com/roudimit/AVLnet.
Possible sources of inspiration for the visual part: [1] https://arxiv.org/abs/2006.09199 [2] https://openaccess.thecvf.com/content_ICCV_2019/html/Miech_HowTo100M_Learning_a_Text-Video_Embedding_by_Watching_Hundred_Million_Narrated_ICCV_2019_paper.html [3] https://openaccess.thecvf.com/content_CVPR_2020/html/Miech_End-to-End_Learning_of_Visual_Representations_From_Uncurated_Instructional_Videos_CVPR_2020_paper.html