kkoutini / PaSST

Efficient Training of Audio Transformers with Patchout
Apache License 2.0
287 stars 48 forks source link

Pre-trained models on ESC-50 #40

Open Antoine101 opened 6 months ago

Antoine101 commented 6 months ago

Hi Khaled,

I want to use the following checkpoints. image

Just to make sure, when you say pre-trained models on ESC-50 in this case, you mean (in chronological order):

  1. Using a model trained on ImageNet
  2. To then train it on Audioset
  3. And later fine-tune on it ESC-50

If so, how can I know which config of default_cfgs in model.py was used for these checkpoints above?

Also, have you pre-trained on all ESC-50 folds at once? During a cross-validation in machine learning with sklearn's GridSearch, the model is ultimately refit on all folds with the best hyperparams config found. Shouldn't we do the same in Deep Learning?

Cheers

Antoine

kkoutini commented 6 months ago

Hi yes they are trained exactly ImageNet -> Audioset -> ESC-50. There is a model for each fold: the model with fold1 in its name is trained on all folds except fold 1. I'm not sure I understand your last question completly but the hyper-parameters used are the same for all folds, passt default config can be found here and here are some examples how to run it. One thing to note, is that the config of the pretrained (specified by arch) should match the config of the model you're trying to fine-tune. I mean you cannot load for example PaSST-L arch=passt_l_kd_p16_128_ap47 while using PaSST-S config, or for example, changing the patch size or overlap. In these cases, the weights shapes won't match when loading the pre-trained models.

Antoine101 commented 6 months ago

Thank you for getting back to me Khaled!

I was trying to do a parallel with sklearn's GridSearchCV which implements cross-validation and has a refit parameter. I basically means that your model is trained on validated on all combination of folds to find the best combination of HP but once its found, the model is refit with this HP conf on the whole dataset (all folds merged). So I wondered if the same should be done in Deep Learning. The more data, the better, so I would assume that cross-val here just gives you a average performance across all folds but retraining your model on all folds together at the end would give you even better performances.

Now the thing is that ESC-50 is a challenge with pre-made folds and no held-out test set. So you wouldn't be able to test your model trained on all folds.

Anyway that's not really related to your framework, I was just curious to know.

kkoutini commented 6 months ago

Hi! thanks for the explanation. I don't know if there a best way to do it since training on all the folds for all hyper-paramters can be slow for large models, but off-course the results will be less noisy.