huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.68k stars 26.22k forks source link

Audio-MAE - ViTMAE for audio #27453

Open justinluong opened 10 months ago

justinluong commented 10 months ago

Model description

This model is is a Self-supervised Vision Transformer that uses patch reconstruction as the spectrogram task. It extends MAE (which is already on HuggingFace) for audio. This model would be a valuable addition as there doesn't seem to be a self-supervised ViT model on HugginFace currently. AST is the closest and uses supervised pre-training. Conceptually, Audio-MAE is also simpler but achieves comparable performance in their paper.

Some differences compared to the standard MAE Model

Open source status

Provide useful links for the implementation

Implementation https://github.com/facebookresearch/AudioMAE Created by Po-Yao Huang @berniebear on Github

Pre-trained Weights Available in the github repo

Paper (Masked Autoencoders that Listen) https://arxiv.org/abs/2207.06405

amyeroberts commented 9 months ago

cc @sanchit-gandhi

Pratyush-exe commented 9 months ago

cc @sanchit-gandhi

Can I pick this up? Would be a valuable learning task for me :)

Thanks

justinluong commented 9 months ago

I should have clarified in my initial post that my intention was to contribute this model personally, as I've been working with the model a lot recently. However, I'm definitely open to collaborate! Maybe we could work together on this @Pratyush-exe :)

Pratyush-exe commented 9 months ago

This model looks very interesting! I would love to collaborate, if that's ohkay with you @justinluong :)

justinluong commented 9 months ago

Hey @Pratyush-exe sorry for the late reply! Things have been quite busy at work recently. If you'd like, please feel free to pick this up instead as I think I won't have bandwidth to work on it for a while. All the best :)

Pratyush-exe commented 9 months ago

Sure @justinluong.

Would love to pick this up.

Pratyush-exe commented 9 months ago

Hi @amyeroberts @sanchit-gandhi Please assign this to me.

Thanks

ArthurZucker commented 9 months ago

Hey, we usually don't assign, just open a PR and link this issue 🤗

Pratyush-exe commented 8 months ago

I have been having problems in reconstructing the kaldi.fbank to audio file. The audio is very noisy. I am using librosa.feature.inverse.mel_to_audio for the conversion. I know fbank and mel_spectogram are not the same thing but that the only thing I found through search. Also the results shown in the original repo is good. Any idea how that is done?

ps4vs commented 8 months ago

Hi all 🤗, @ArthurZucker @justinluong

I have started working on adding AudioMAE as my first contribution to hugging-face

I made some notes on AudioMAE Notes, and want to add that decoder alone uses local and hybrid attention, which is thrown during finetuning.

sanchit-gandhi commented 7 months ago

Awesome to see so much interest in this model! Given AST is super popular on the Hub as the de facto audio classification model, the model has a permissive license, and the original implementation is somewhat difficult to run, I think this would be a valuable new model addition. Feel free to open a PR to start the contribution! You can start by copying the most related model (either MAE or AST) and then gradually update the code to bring it into alignment with Audio-MAE. Here's the full guide for contributing a model, which explains this process https://huggingface.co/docs/transformers/add_new_model

cc @ylacombe as well

pennychong94 commented 6 months ago

@ps4vs @Pratyush-exe What is the status on this?

ArthurZucker commented 6 months ago

let's open a PR and link it @ps4vs 🤗

ps4vs commented 6 months ago

Roger that, I will open the PR by EOD.

pennychong94 commented 6 months ago

@ps4vs when do you expect the model code to be put in the PR