kkoutini / PaSST

Efficient Training of Audio Transformers with Patchout
Apache License 2.0
287 stars 48 forks source link

Where is input normalization applied? #49

Open Antoine101 opened 3 months ago

Antoine101 commented 3 months ago

Hi Khaled,

Could you please point me to where normalization is applied to inputs? (for the esc50 case or any other cases)

I am talking about channels mean and std such as written in the code below:

IMAGENET_DEFAULT_MEAN = (0.485, 0.456, 0.406)
IMAGENET_DEFAULT_STD = (0.229, 0.224, 0.225)
IMAGENET_INCEPTION_MEAN = (0.5, 0.5, 0.5)
IMAGENET_INCEPTION_STD = (0.5, 0.5, 0.5)

def _cfg(url='', **kwargs):
    return {
        'url': url,
        'num_classes': 1000, 'input_size': (3, 224, 224), 'pool_size': None,
        'crop_pct': .9, 'interpolation': 'bicubic', 'fixed_input_size': True,
        'mean': IMAGENET_INCEPTION_MEAN, 'std': IMAGENET_INCEPTION_STD,
        'first_conv': 'patch_embed.proj', 'classifier': 'head',
        **kwargs
    }

If the first training was done on ImageNet, then I guess ImageNet channels mean and std are applied to Audiosets input when finetuning on this dataset, and also to ESC50 inputs if further finetuning on this one. Am I correct?

Again, I am trying to refactor your code to have only the interesting portion for us fit into our already existing training scripts. But I don't see where those means and standard deviations are applied, whether in the dataset or in AugmentMel.

Thanks a lot (again)

Antoine

Antoine101 commented 2 months ago

Up @kkoutini Not sure if you saw this.

How should ImageNet normalization statistics be cascaded down to MelSpec 1 channel inputs for downstream finetuning? Where is this applied in the code?

Many thanks

kkoutini commented 2 months ago

Hi Antoine, I'm sorry I missed this issue. The normallization is applied (hard coded) here I think stats is was calculated based on a subset of Audioset. In my runs, I used the same spectrogram prerpocessor to all datasets for fine-tuning.

Antoine101 commented 2 months ago

Thanks for getting back to me Khaled!

Ok I see!

So the first training is done on ImageNet with ImageNet statistics, then the model pretrained on ImageNet is finetuned on Audioset, using Audioset statistics, correct? So if I later finetune the model already finetuned on Audioset on another dataset I should use mean=4.5 and std=5.

Are the two statistics (ImageNet's and Audioset's) not related in any way? Shouldn't ImageNet statistics have been propagated all the way down, aggregated from 3 channels to 1?

Finally, I see that you normalize after applying masks. Is this the correct way to do it?

I noted in your paper the following augmentations:

I struggle to understand the order in which everything goes. I see _mymixup after _melforward in ex_esc50.py although it is said in your paper that waveforms are mixed.

I would expect the following steps: waveform loading->waveforms mixup->mel (feature computation)-> augmentations

How is it really?

Many thanks

kkoutini commented 2 months ago

yes, I think you can keep the same, mean=4.5 and std=5 if you're using the same spectrograms module.

Finally, I see that you normalize after applying masks. Is this the correct way to do it?

Ah, I guess you may get imporvments, if you do the masking after normalizing.

yes, the correct order is oading->waveforms mixup->mel (feature computation)-> augmentations. The waveform mixing is done first in the dataset here with the waveform augmentations.