Closed patrickvonplaten closed 1 year ago
Hi Patrick,
Thanks so much to bring this up. I'd love to add AST to Huggingface.
I am checking https://github.com/huggingface/transformers/tree/main/templates/adding_a_new_model for how to do that, is that the right tutorial I should read?
I also have a quick question: the AST model is built on the timm
package, is that OK with such dependency?
Thanks!
-Yuan
Oh wow super cool to get such a quick answer from you :-) It would be amazing if you could give it a try with the https://github.com/huggingface/transformers/tree/main/templates/adding_a_new_model docs!
If possible, it'd be great to avoid a timm
dependency and instead more or less copy paste the relevant code of timm
into transformers
. @NielsRogge mentioned that the model is more or less a ViT, so maybe we can get some inspiration from the existing ViT model: https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_vit.py . It's also totally fine to add the timm
dependency in a first step and we'll help you adjust the PR afterwards :-)
cc @sgugger @LysandreJik @anton-l for visibility
Sure. Thanks for the suggestion. I will have a try and let you know if I have a question.
@YuanGongND if I may pitch in -- without timm
TF folks like me can easily port your model to TF π§‘
@YuanGongND Did it work ?
Hi, I just need to find some time to do that, will be soon!
@YuanGongND did you manage to get this going or do you need some help? I'd be interested in contributing if possible! π
Hi,
I actually have a working implementation which I need to finish. Will open a PR soon, hopefully
Wow @NielsRogge, that's great!
Gotcha Niels, thanks!
π New model addition
Model description
In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for endto-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2. Index Terms: audio classification, self-attention
Open source status
Happy to supervise anyone interested in porting the model :-)