Runtime of inference and model size

fakufaku commented 1 year ago

Hi, thanks a million for sharing the code for this cool work! ❤️

I am trying to use the NCSN++ model (for a slightly different purpose and dataset), and I have the two following questions.

1) The default model size is very large (65M parameters). Since the size was not indicated in the paper, could you please confirm this is what you use ? If the size is different, could you please indicate how is your model different from the default one. BTW, do you think that such a large model is necessary ?

2) The run time at inference with the PC samples (N=30, corrector_steps=1) is about 30 seconds for a batch of 20 samples of length max. 15 seconds. Does that match your experience ?

Thanks in advance!! 🤗

julius-richter commented 1 year ago

Hi and thanks for your message!

Yes, the NCSN++ model we use has a size of 65M parameters. We use the configuration of Song et al. when training on 256 x 256 CelebA-HQ using the continuous VE SDE, which we found to be similar to our setting. It may well be possible to achieve similar performance with a reduced model variant. However, we did not perform an experiment on potential model reduction in our paper.
For inference, we did not use batch processing, but simply processed utterance by utterance, as it can be seen in enhancement.py. The reason for this is that with a large batch size, CUDA memory can quickly become scarce. For batch processing, you have to zero pad shorter utterances to match the sequence dimension for utterances of variable length. This may be not optimal when your test set contains various utterance lengths. Nevertheless, the real time factor with the PC sampler (N=30, corrector_steps=1) based on the zero-padded utterances should remain approximately the same when using batch processing. In the paper, we used an NVIDIA GeForce RTX 2080 Ti GPU, in a machine with an Intel Core i7-7800X CPU @ 3.50GHz and achieved a real time factor of 1.77, which is similar with the measurement you reported.

fakufaku commented 1 year ago

Thank you so much for taking the time to answer. This is very useful. It is reassuring to know things seem to be as expected.

One more question, do you have any guidance on setting the various hyperparameters, e.g.:

sigma_min, sigma_max, for example, did you set it to obtain a given SNR at t=0 or t=1 ?
snr parameter of the "ald" corrector
lambda (stiffness)

julius-richter commented 1 year ago

The hyperparameters were set mostly empirical or based on grid search. For detailed information please check Section IV-D in the journal pre-print.

But to answer your specific questions right away:

sigma_min was set so that the white Gaussian noise is so low that it is not noticeable for $t=0$, i.e. in the final clean speech estimate.
sigma_max was set so that the particular characteristics in each noisy speech sample are strongly masked by the Gaussian white noise for $t=1$. We assume that because the clean speech is still recognizable for $t=1$, the reverse process is guided and therefore does not require so many reverse steps.
The snr parameter, aka the step size in the annealed Langevin dynamcis, has been obtained by grid search (see Fig. 4b). Interestingly, it represents a compromise between PESQ and SI-SDR. We choose snr=0.5 to achieve a maximum PESQ value while still obtaining a good value for SI-SDR. We assume that with a larger step size, more white Gaussian noise is subtracted in each sampler step. This in turn ensures that fewer new artifacts are created, but also that less environmental noise is masked and removed in subsequent sampler steps. With snr=0.5, we achieve a good balance between the removal of environmental noise and the white Gaussian noise used for generative modeling.
The stiffness parameter is chosen such that the mean $\boldsymbol \mu$ of the process state $\mathbf x_t$ is close to the noisy speech $\mathbf y$ at $t=T$. We verify this assumption by demonstrating that $\mathbb{E}\left[|\boldsymbol \mu(\mathbf x_0, \mathbf y, T) - \mathbf y|^2\right]<10^{-3}$, where the expectation is calculated over all complex bins in a random sample of 256 spectrogram pairs from the chosen dataset.

fakufaku commented 1 year ago

Thank you for all the precious information!

sp-uhh / sgmse

Runtime of inference and model size #11