patrickvonplaten commented 2 years ago

🌟 New model addition

Model description

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for endto-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2. Index Terms: audio classification, self-attention

Open source status

[x] the model implementation is available: https://github.com/YuanGongND/ast
[x] the model weights are available: https://github.com/YuanGongND/ast#Pretrained-Models
[x] who are the authors: @YuanGongND

Happy to supervise anyone interested in porting the model :-)

YuanGongND commented 2 years ago

Hi Patrick,

Thanks so much to bring this up. I'd love to add AST to Huggingface.

I am checking https://github.com/huggingface/transformers/tree/main/templates/adding_a_new_model for how to do that, is that the right tutorial I should read?

I also have a quick question: the AST model is built on the timm package, is that OK with such dependency?

Thanks!

-Yuan

patrickvonplaten commented 2 years ago

Oh wow super cool to get such a quick answer from you :-) It would be amazing if you could give it a try with the https://github.com/huggingface/transformers/tree/main/templates/adding_a_new_model docs!

If possible, it'd be great to avoid a timm dependency and instead more or less copy paste the relevant code of timm into transformers. @NielsRogge mentioned that the model is more or less a ViT, so maybe we can get some inspiration from the existing ViT model: https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_vit.py . It's also totally fine to add the timm dependency in a first step and we'll help you adjust the PR afterwards :-)

patrickvonplaten commented 2 years ago

cc @sgugger @LysandreJik @anton-l for visibility

YuanGongND commented 2 years ago

Sure. Thanks for the suggestion. I will have a try and let you know if I have a question.

gante commented 2 years ago

@YuanGongND if I may pitch in -- without timm TF folks like me can easily port your model to TF 🧡

log02 commented 2 years ago

@YuanGongND Did it work ?

YuanGongND commented 2 years ago

Hi, I just need to find some time to do that, will be soon!

thefirebanks commented 1 year ago

@YuanGongND did you manage to get this going or do you need some help? I'd be interested in contributing if possible! 😄

YuanGongND commented 1 year ago

Hi @thefirebanks, I apologize that I won't have time to do it in near future. It would be great if you could contribute! FYI, we have a repo at here (implemented in PyTorch and depends on timm) and a colab demo at here. Please let me know if you have any question!

Best, Yuan

NielsRogge commented 1 year ago

Hi,

I actually have a working implementation which I need to finish. Will open a PR soon, hopefully

YuanGongND commented 1 year ago

Wow @NielsRogge, that's great!

thefirebanks commented 1 year ago

Gotcha Niels, thanks!

huggingface / transformers

AST: Audio Spectrogram Transformer #16383

🌟 New model addition

Model description

Open source status