Open rationalism opened 8 months ago
I totally support this proposal. The code and checkpoints are available at https://github.com/YuanGongND/cav-mae.
Created an initial PR here by cloning the visual-audio multimodal model TVLT: https://github.com/huggingface/transformers/pull/28246
(haven't actually added the CAV-MAE code yet! this is just a scaffold)
Model description
Contrastive Audio-Visual Masked Autoencoder (CAV-MAE) combines two major self-supervised learning frameworks: contrastive learning and masked data modeling, to learn a joint and coordinated audio-visual representation. It appears to be the open source SOTA on the AudioSet and VGGSound datasets (the OmniVec and Facebook MAViL models seem to have never had weights released).
Open source status
Provide useful links for the implementation
https://github.com/YuanGongND/cav-mae
@YuanGongND