Open Antoine101 opened 7 months ago
Up @kkoutini Not sure if you saw this.
How should ImageNet normalization statistics be cascaded down to MelSpec 1 channel inputs for downstream finetuning? Where is this applied in the code?
Many thanks
Hi Antoine, I'm sorry I missed this issue. The normallization is applied (hard coded) here I think stats is was calculated based on a subset of Audioset. In my runs, I used the same spectrogram prerpocessor to all datasets for fine-tuning.
Thanks for getting back to me Khaled!
Ok I see!
So the first training is done on ImageNet with ImageNet statistics, then the model pretrained on ImageNet is finetuned on Audioset, using Audioset statistics, correct? So if I later finetune the model already finetuned on Audioset on another dataset I should use mean=4.5 and std=5.
Are the two statistics (ImageNet's and Audioset's) not related in any way? Shouldn't ImageNet statistics have been propagated all the way down, aggregated from 3 channels to 1?
Finally, I see that you normalize after applying masks. Is this the correct way to do it?
I noted in your paper the following augmentations:
I struggle to understand the order in which everything goes. I see _mymixup after _melforward in ex_esc50.py although it is said in your paper that waveforms are mixed.
I would expect the following steps: waveform loading->waveforms mixup->mel (feature computation)-> augmentations
How is it really?
Many thanks
yes, I think you can keep the same, mean=4.5 and std=5 if you're using the same spectrograms module.
Finally, I see that you normalize after applying masks. Is this the correct way to do it?
Ah, I guess you may get imporvments, if you do the masking after normalizing.
yes, the correct order is oading->waveforms mixup->mel (feature computation)-> augmentations. The waveform mixing is done first in the dataset here with the waveform augmentations.
Hi Khaled,
Could you please point me to where normalization is applied to inputs? (for the esc50 case or any other cases)
I am talking about channels mean and std such as written in the code below:
If the first training was done on ImageNet, then I guess ImageNet channels mean and std are applied to Audiosets input when finetuning on this dataset, and also to ESC50 inputs if further finetuning on this one. Am I correct?
Again, I am trying to refactor your code to have only the interesting portion for us fit into our already existing training scripts. But I don't see where those means and standard deviations are applied, whether in the dataset or in AugmentMel.
Thanks a lot (again)
Antoine