sp-uhh / sgmse

Score-based Generative Models (Diffusion Models) for Speech Enhancement and Dereverberation
MIT License
454 stars 69 forks source link

Runtime of inference and model size #11

Closed fakufaku closed 1 year ago

fakufaku commented 1 year ago

Hi, thanks a million for sharing the code for this cool work! ❤️

I am trying to use the NCSN++ model (for a slightly different purpose and dataset), and I have the two following questions.

1) The default model size is very large (65M parameters). Since the size was not indicated in the paper, could you please confirm this is what you use ? If the size is different, could you please indicate how is your model different from the default one. BTW, do you think that such a large model is necessary ?

2) The run time at inference with the PC samples (N=30, corrector_steps=1) is about 30 seconds for a batch of 20 samples of length max. 15 seconds. Does that match your experience ?

Thanks in advance!! 🤗

julius-richter commented 1 year ago

Hi and thanks for your message!

  1. Yes, the NCSN++ model we use has a size of 65M parameters. We use the configuration of Song et al. when training on 256 x 256 CelebA-HQ using the continuous VE SDE, which we found to be similar to our setting. It may well be possible to achieve similar performance with a reduced model variant. However, we did not perform an experiment on potential model reduction in our paper.
  2. For inference, we did not use batch processing, but simply processed utterance by utterance, as it can be seen in enhancement.py. The reason for this is that with a large batch size, CUDA memory can quickly become scarce. For batch processing, you have to zero pad shorter utterances to match the sequence dimension for utterances of variable length. This may be not optimal when your test set contains various utterance lengths. Nevertheless, the real time factor with the PC sampler (N=30, corrector_steps=1) based on the zero-padded utterances should remain approximately the same when using batch processing. In the paper, we used an NVIDIA GeForce RTX 2080 Ti GPU, in a machine with an Intel Core i7-7800X CPU @ 3.50GHz and achieved a real time factor of 1.77, which is similar with the measurement you reported.
fakufaku commented 1 year ago

Thank you so much for taking the time to answer. This is very useful. It is reassuring to know things seem to be as expected.

One more question, do you have any guidance on setting the various hyperparameters, e.g.:

julius-richter commented 1 year ago

The hyperparameters were set mostly empirical or based on grid search. For detailed information please check Section IV-D in the journal pre-print.

But to answer your specific questions right away:

fakufaku commented 1 year ago

Thank you for all the precious information!