huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.09k stars 26.31k forks source link

Add CAV-MAE audio-image encoder model #28236

Open rationalism opened 8 months ago

rationalism commented 8 months ago

Model description

Contrastive Audio-Visual Masked Autoencoder (CAV-MAE) combines two major self-supervised learning frameworks: contrastive learning and masked data modeling, to learn a joint and coordinated audio-visual representation. It appears to be the open source SOTA on the AudioSet and VGGSound datasets (the OmniVec and Facebook MAViL models seem to have never had weights released).

Open source status

Provide useful links for the implementation

https://github.com/YuanGongND/cav-mae

@YuanGongND

YuanGongND commented 8 months ago

I totally support this proposal. The code and checkpoints are available at https://github.com/YuanGongND/cav-mae.

rationalism commented 8 months ago

Created an initial PR here by cloning the visual-audio multimodal model TVLT: https://github.com/huggingface/transformers/pull/28246

(haven't actually added the CAV-MAE code yet! this is just a scaffold)