Open justinluong opened 1 year ago
cc @sanchit-gandhi
cc @sanchit-gandhi
Can I pick this up? Would be a valuable learning task for me :)
Thanks
I should have clarified in my initial post that my intention was to contribute this model personally, as I've been working with the model a lot recently. However, I'm definitely open to collaborate! Maybe we could work together on this @Pratyush-exe :)
This model looks very interesting! I would love to collaborate, if that's ohkay with you @justinluong :)
Hey @Pratyush-exe sorry for the late reply! Things have been quite busy at work recently. If you'd like, please feel free to pick this up instead as I think I won't have bandwidth to work on it for a while. All the best :)
Sure @justinluong.
Would love to pick this up.
Hi @amyeroberts @sanchit-gandhi Please assign this to me.
Thanks
Hey, we usually don't assign, just open a PR and link this issue 🤗
I have been having problems in reconstructing the kaldi.fbank
to audio file. The audio is very noisy.
I am using librosa.feature.inverse.mel_to_audio
for the conversion. I know fbank and mel_spectogram are not the same thing but that the only thing I found through search.
Also the results shown in the original repo is good. Any idea how that is done?
Hi all 🤗, @ArthurZucker @justinluong
I have started working on adding AudioMAE as my first contribution to hugging-face
I made some notes on AudioMAE Notes, and want to add that decoder alone uses local and hybrid attention, which is thrown during finetuning.
Awesome to see so much interest in this model! Given AST is super popular on the Hub as the de facto audio classification model, the model has a permissive license, and the original implementation is somewhat difficult to run, I think this would be a valuable new model addition. Feel free to open a PR to start the contribution! You can start by copying the most related model (either MAE or AST) and then gradually update the code to bring it into alignment with Audio-MAE. Here's the full guide for contributing a model, which explains this process https://huggingface.co/docs/transformers/add_new_model
cc @ylacombe as well
@ps4vs @Pratyush-exe What is the status on this?
let's open a PR and link it @ps4vs 🤗
Roger that, I will open the PR by EOD.
@ps4vs when do you expect the model code to be put in the PR
Model description
This model is is a Self-supervised Vision Transformer that uses patch reconstruction as the spectrogram task. It extends MAE (which is already on HuggingFace) for audio. This model would be a valuable addition as there doesn't seem to be a self-supervised ViT model on HugginFace currently. AST is the closest and uses supervised pre-training. Conceptually, Audio-MAE is also simpler but achieves comparable performance in their paper.
Some differences compared to the standard MAE Model
Open source status
Provide useful links for the implementation
Implementation https://github.com/facebookresearch/AudioMAE Created by Po-Yao Huang @berniebear on Github
Pre-trained Weights Available in the github repo
Paper (Masked Autoencoders that Listen) https://arxiv.org/abs/2207.06405