Add CAV-MAE audio-image encoder model

rationalism commented 8 months ago

Model description

Contrastive Audio-Visual Masked Autoencoder (CAV-MAE) combines two major self-supervised learning frameworks: contrastive learning and masked data modeling, to learn a joint and coordinated audio-visual representation. It appears to be the open source SOTA on the AudioSet and VGGSound datasets (the OmniVec and Facebook MAViL models seem to have never had weights released).

Open source status

[X] The model implementation is available
[X] The model weights are available

Provide useful links for the implementation

https://github.com/YuanGongND/cav-mae

@YuanGongND

YuanGongND commented 8 months ago

I totally support this proposal. The code and checkpoints are available at https://github.com/YuanGongND/cav-mae.

rationalism commented 8 months ago

Created an initial PR here by cloning the visual-audio multimodal model TVLT: https://github.com/huggingface/transformers/pull/28246

(haven't actually added the CAV-MAE code yet! this is just a scaffold)

huggingface / transformers