Audio-MAE - ViTMAE for audio

justinluong commented 1 year ago

Model description

This model is is a Self-supervised Vision Transformer that uses patch reconstruction as the spectrogram task. It extends MAE (which is already on HuggingFace) for audio. This model would be a valuable addition as there doesn't seem to be a self-supervised ViT model on HugginFace currently. AST is the closest and uses supervised pre-training. Conceptually, Audio-MAE is also simpler but achieves comparable performance in their paper.

Some differences compared to the standard MAE Model

During pre-training local and hybrid attention mechanisms can be used.
During fine-tuning masking is also used which differs to MAE.

Open source status

[X] The model implementation is available
[X] The model weights are available

Provide useful links for the implementation

Implementation https://github.com/facebookresearch/AudioMAE Created by Po-Yao Huang @berniebear on Github

Pre-trained Weights Available in the github repo

Paper (Masked Autoencoders that Listen) https://arxiv.org/abs/2207.06405

amyeroberts commented 1 year ago

cc @sanchit-gandhi

Pratyush-exe commented 1 year ago

cc @sanchit-gandhi

Can I pick this up? Would be a valuable learning task for me :)

Thanks

justinluong commented 1 year ago

I should have clarified in my initial post that my intention was to contribute this model personally, as I've been working with the model a lot recently. However, I'm definitely open to collaborate! Maybe we could work together on this @Pratyush-exe :)

Pratyush-exe commented 1 year ago

This model looks very interesting! I would love to collaborate, if that's ohkay with you @justinluong :)

justinluong commented 1 year ago

Hey @Pratyush-exe sorry for the late reply! Things have been quite busy at work recently. If you'd like, please feel free to pick this up instead as I think I won't have bandwidth to work on it for a while. All the best :)

Pratyush-exe commented 1 year ago

Sure @justinluong.

Would love to pick this up.

Pratyush-exe commented 1 year ago

Hi @amyeroberts @sanchit-gandhi Please assign this to me.

Thanks

ArthurZucker commented 1 year ago

Hey, we usually don't assign, just open a PR and link this issue 🤗

Pratyush-exe commented 11 months ago

I have been having problems in reconstructing the kaldi.fbank to audio file. The audio is very noisy. I am using librosa.feature.inverse.mel_to_audio for the conversion. I know fbank and mel_spectogram are not the same thing but that the only thing I found through search. Also the results shown in the original repo is good. Any idea how that is done?

ps4vs commented 11 months ago

Hi all 🤗, @ArthurZucker @justinluong

I have started working on adding AudioMAE as my first contribution to hugging-face

I made some notes on AudioMAE Notes, and want to add that decoder alone uses local and hybrid attention, which is thrown during finetuning.

sanchit-gandhi commented 10 months ago

Awesome to see so much interest in this model! Given AST is super popular on the Hub as the de facto audio classification model, the model has a permissive license, and the original implementation is somewhat difficult to run, I think this would be a valuable new model addition. Feel free to open a PR to start the contribution! You can start by copying the most related model (either MAE or AST) and then gradually update the code to bring it into alignment with Audio-MAE. Here's the full guide for contributing a model, which explains this process https://huggingface.co/docs/transformers/add_new_model

cc @ylacombe as well

pennychong94 commented 9 months ago

@ps4vs @Pratyush-exe What is the status on this?

ArthurZucker commented 9 months ago

let's open a PR and link it @ps4vs 🤗

ps4vs commented 9 months ago

Roger that, I will open the PR by EOD.

pennychong94 commented 9 months ago

@ps4vs when do you expect the model code to be put in the PR

huggingface / transformers