Questions on the evaluation on the VB-DMD dataset

KeiKinn commented 2 years ago

Hi, thank you for your excellent work.

I'm trying to reproduce the model's results on the VB-DMD. I tried to generate the enhanced speech with the audio data from speakers p226 and p287 as you mentioned in Issue 13.

The command I used is:

python3 enhancement.py --test_dir 'path_to_test' --enhanced_dir 'path_to_enhanced' --ckpt 'train_vb_29nqe0uh_epoch=115.ckpt'

The test dir contains clean and noisy data from the speakers p226 and p287 without any preprocessing.

The generated speeches are quite different from the demo page. It generated some monster-like voice which was quite weird. I picked some randomly and please check it here.

Here are my questions:

Do I use the right command and test dataset for evaluation?
Do you have any clues about why the performance on my test set is so bad?
Should I do some preprocessing on the data before I do the evaluation?

Thank you in advance for your time and help!

cobalamin commented 2 years ago

Hi, thank you for your interest!

I think the key problem is a mismatched sampling rate. We have our method operate at 16kHz throughout, and also use a downsampled version of the dataset for this. You get this monster-like voice because enhancement.py reads in your 48kHz files assuming that they are 16kHz. This is also consistent with the increased length of the audio files you uploaded, which are all (48/16=)3 times as long as the original 48kHz files.

If you want to work with our pretrained models, you should downsample any input audio to the assumed 16kHz. In principle our method can be used with higher sampling rates, but we have not done so. You'd need to change a few places in this codebase (including the STFT parameters) where 16kHz are assumed, and train your own DNN with a fitting dataset.

Hope this helps :)

KeiKinn commented 2 years ago

Great, it works, thank you for your help.

Great work!

sp-uhh / sgmse

Questions on the evaluation on the VB-DMD dataset #15