kkoutini / PaSST

Efficient Training of Audio Transformers with Patchout
Apache License 2.0
287 stars 48 forks source link

From ViT models to audio #45

Open Antoine101 opened 4 months ago

Antoine101 commented 4 months ago

Hi Khaled,

In your code, there is the possibility to create a ViT architecture and load the corresponding pretrained weights (like "vit_tiny_patch16_224").

Do we agree that such architectures only work with similar size inputs (224224 for example)? If so, how did you finetune a model on Audioset that was initially trained on Imagenet (going from 224224 to 128*998 for example)? Is this procedure in some code in your repo?

I read the AST paper I guess you took inspiration from and they talk about it in some details. I was just wondering how I would do the whole process (ImageNet -> AudioSet -> ESC50) on my end.

Thanks a lot.

Antoine

kkoutini commented 3 months ago

Hi Antoine,

Yes, the code should support more architectures.

If the input channels are different, the input channels are averaged here and here

If the input size is different (for example, 224x224 to 128x998), the only thing that is changed is the positional embeddings, this is done here in short, the positional embeddings are interpolated to match the new size (similar to AST). After that, they are averged over time/freq to produce freq/time positional embeddings.

Antoine101 commented 3 months ago

Great, I'll have a look at all that! Thanks a lot.

Antoine101 commented 3 months ago

Regarding that question, I see in the code that there are different lists of architectures available between get_model function, default_cfg dictionnary and architecture functions.

default_cfg seems to be the most exhaustive but not every architecture in this dict is covered in get_model or has a dedicated function that calls _create_vision_transformer. Is it just because you didn't test them all or didn't have time to implement everything or is there another specific reason?

See below: image

image

image

Thanks a lot.

kkoutini commented 3 months ago

Hi, I got the basis code from tim library. where it has links for different models then I added the models that I trained one by one in the same fashion with a link to download the weights. The missing ones are the ones that I didn't use. However, I believe it should work if you add more ViT in the same way.

Antoine101 commented 3 months ago

Hi Khaled,

Thanks for the answer.

Regarding your first reply on this thread, concerning the adaptation/averaging of input channels, why is there a sum on dim=1 in the code instead of a mean? In adapt_input_conv here

kkoutini commented 3 months ago

I think you are right mean should work better

Antoine101 commented 3 months ago

Great, thanks for the confirmation!