Closed leolya closed 2 years ago
Hi @leolya ,
Thanks for your reply! @Sanyuan-Chen
I mean the pre-trained WavLM on Huggingface. It seems like normalization is not applied to the input.
Yes, during pre-training, we didn't apply normalization to the input waveform of WavLM Base model, but did apply normalization to the input waveform of WavLM Large model.
For the separation fine-tuning, as introduced in our paper, there are two input features to the separation downstream model, which are the STFT feature and the pre-trained representation extracted from the WavLM model.
For the STFT feature, we "apply the log spectrogram with utterance-wise mean-variance normalization".
For the pre-trained representation, we use the same normalization setting as the pre-training stage, i.e. normalization is not applied for the WavLM Base model, but applyed for WavLM Large model.
Thanks for replying!
Yes, during pre-training, we didn't apply normalization to the input waveform of WavLM Base model, but did apply normalization to the input waveform of WavLM Large model.
For the separation fine-tuning, as introduced in our paper, there are two input features to the separation downstream model, which are the STFT feature and the pre-trained representation extracted from the WavLM model.
For the STFT feature, we "apply the log spectrogram with utterance-wise mean-variance normalization".
For the pre-trained representation, we use the same normalization setting as the pre-training stage, i.e. normalization is not applied for the WavLM Base model, but applyed for WavLM Large model.
by the way, which python package do you use to extract waveform from audio file?
Describe Thanks for sharing the code. I am trying to use the conformer with WavLM for source separation and I have some questions about the implementation details.
I'm wondering what is the batchsize when using the WavLM with conformer for source separation?
I am also wondering the range of the input? Should the input be normalized? It seems that the models provided on the Huggingface are not using the normalized input. However, to evaluate on the LibriCSS dataset, I think normalization is necessary since the recorded volume is very low.