sp-uhh / sgmse

Score-based Generative Models (Diffusion Models) for Speech Enhancement and Dereverberation
MIT License
454 stars 69 forks source link

Question about performance gap between valid set & test set of VB-DMD dataset #13

Closed Kuray107 closed 1 year ago

Kuray107 commented 1 year ago

First of all, thank you very much for providing the code with such good quality!

I am currently trying to reproduce the result of the model on the VB-DMD dataset, which I download from the link here. The training set I used is the clean & noisy_trainset_28spk_wav, where I split all 468 files from the speaker p286 as my valid set. The command I used for training is as follows:

python train.py --base_dir VB-DMD_dataset/ --accelerator gpu --gpus 2 --batch_size 12 --no_wandb --max_epochs 160

To my surprise, the result I got on my valid set is very poor according to the tensorboard's log: The PESQ score is about 2.2, and the ESTOI value converges 0.82. However, after I test the model on the testing set, the result is much closer to the paper's result: The PESQ score is 2.73 (plus-minus 0.55), and the STOI score is 0.86 (plus-minus 0.10). Now here are my questions:

  1. Do you have any clues about why the model's performance on my valid-set is so bad?
  2. Right now the PESQ score I got on the testing set is not ideal compared with the paper's result (2.73 v.s. 2.93). I know that the batch-size in my current setting is 24 instead of 32. However, do I need to change other hyper-parameters during training as well if I would like to reproduce your result? If so, could you give me a simple command showing how to set them?

Thank you in advance for your time and help!

julius-richter commented 1 year ago

Hey, thanks for your interest!

  1. This probably depends on the choice of speaker(s) in the validation set. We refer to our baseline Diffuse and have chosen speakers p226 and p287 (see here).
  2. No, you do not need to change other hyperparameters for the training. The command is train.py --base_dir /data/VoiceBank/ --batch_size 8 --gpus 4. Hyperparameters such as spec_factor, spec_abs_exponent, sigma_max etc. do not need to be explicitly specified, since the values used in the paper are given as default values.

Hope that helps!

Kuray107 commented 1 year ago

Thanks for the reply! I retrained the model with your instruction but still get a similar result on testing set (PESQ ~ 2.7). The pre-trained checkpoint you provide indeed achieves ~ 2.9 on PESQ score, so I think somehow the default training setting on my side is not optimal. The GPUs I used for training is A40, but it shouldn't make such a huge difference. Do you have any suggestions for me to check something else? And, if it is possible, would you like to re-train the model as well with default setting to confirm it will generate the correct result?

julius-richter commented 1 year ago

I compared the released code with the code we used for the pre-trained model checkpoint, and there was indeed a mismatch on one hyper-parameter. The pre-trained model checkpoint uses centered=True, which should also be the default setting when training SGMSE+. We have updated the code accordingly. Thanks you for bringing this issue to our attention and helping us finding the bug in the code.

We retrained the model with the updated code on VoiceBank-Demand, and the model achieved PESQ: 2.93, ESTOI: 0.86, SI-SDR: 17.4, which is very similar to the values reported in the paper. The small deviation could be due to the stochastic nature of the method and the training procedure.

We encourage you to pull the updated code and start another training. Please let us know if it works properly now.

Kuray107 commented 1 year ago

Hello Julius, thank you for the code update! I've re-run the experiment and this time the evaluation result is good now : ).