microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.63k stars 2.51k forks source link

WavLM for Source Separation #708

Closed leolya closed 2 years ago

leolya commented 2 years ago

Describe Thanks for sharing the code. I am trying to use the conformer with WavLM for source separation and I have some questions about the implementation details.

  1. I'm wondering what is the batchsize when using the WavLM with conformer for source separation?

  2. I am also wondering the range of the input? Should the input be normalized? It seems that the models provided on the Huggingface are not using the normalized input. However, to evaluate on the LibriCSS dataset, I think normalization is necessary since the recorded volume is very low.

Sanyuan-Chen commented 2 years ago

Hi @leolya ,

  1. Each training batch consists of 96 audio chunks, and chunk_size=4s, chunk_hop=2s.
  2. We did apply the log spectrogram with utterance-wise mean-variance normalization when fine-tuning the separation model, and I think we didn't release the separation model on Huggingface yet~
leolya commented 2 years ago

Thanks for your reply! @Sanyuan-Chen

I mean the pre-trained WavLM on Huggingface. It seems like normalization is not applied to the input.

Sanyuan-Chen commented 2 years ago

Yes, during pre-training, we didn't apply normalization to the input waveform of WavLM Base model, but did apply normalization to the input waveform of WavLM Large model.

For the separation fine-tuning, as introduced in our paper, there are two input features to the separation downstream model, which are the STFT feature and the pre-trained representation extracted from the WavLM model.

For the STFT feature, we "apply the log spectrogram with utterance-wise mean-variance normalization".

For the pre-trained representation, we use the same normalization setting as the pre-training stage, i.e. normalization is not applied for the WavLM Base model, but applyed for WavLM Large model.

leolya commented 2 years ago

Thanks for replying!

Dorniwang commented 2 years ago

Yes, during pre-training, we didn't apply normalization to the input waveform of WavLM Base model, but did apply normalization to the input waveform of WavLM Large model.

For the separation fine-tuning, as introduced in our paper, there are two input features to the separation downstream model, which are the STFT feature and the pre-trained representation extracted from the WavLM model.

For the STFT feature, we "apply the log spectrogram with utterance-wise mean-variance normalization".

For the pre-trained representation, we use the same normalization setting as the pre-training stage, i.e. normalization is not applied for the WavLM Base model, but applyed for WavLM Large model.

by the way, which python package do you use to extract waveform from audio file?