Closed fakufaku closed 1 year ago
Hi and thanks for your message!
enhancement.py
. The reason for this is that with a large batch size, CUDA memory can quickly become scarce. For batch processing, you have to zero pad shorter utterances to match the sequence dimension for utterances of variable length. This may be not optimal when your test set contains various utterance lengths. Nevertheless, the real time factor with the PC sampler (N=30
, corrector_steps=1
) based on the zero-padded utterances should remain approximately the same when using batch processing. In the paper, we used an NVIDIA GeForce RTX 2080 Ti GPU, in a machine with an Intel Core i7-7800X CPU @ 3.50GHz and achieved a real time factor of 1.77, which is similar with the measurement you reported. Thank you so much for taking the time to answer. This is very useful. It is reassuring to know things seem to be as expected.
One more question, do you have any guidance on setting the various hyperparameters, e.g.:
snr
parameter of the "ald" correctorlambda
(stiffness)The hyperparameters were set mostly empirical or based on grid search. For detailed information please check Section IV-D in the journal pre-print.
But to answer your specific questions right away:
sigma_min
was set so that the white Gaussian noise is so low that it is not noticeable for $t=0$, i.e. in the final clean speech estimate.sigma_max
was set so that the particular characteristics in each noisy speech sample are strongly masked by the Gaussian white noise for $t=1$. We assume that because the clean speech is still recognizable for $t=1$, the reverse process is guided and therefore does not require so many reverse steps. snr
parameter, aka the step size in the annealed Langevin dynamcis, has been obtained by grid search (see Fig. 4b). Interestingly, it represents a compromise between PESQ and SI-SDR. We choose snr=0.5
to achieve a maximum PESQ value while still obtaining a good value for SI-SDR. We assume that with a larger step size, more white Gaussian noise is subtracted in each sampler step. This in turn ensures that fewer new artifacts are created, but also that less environmental noise is masked and removed in subsequent sampler steps. With snr=0.5
, we achieve a good balance between the removal of environmental noise and the white Gaussian noise used for generative modeling.Thank you for all the precious information!
Hi, thanks a million for sharing the code for this cool work! ❤️
I am trying to use the NCSN++ model (for a slightly different purpose and dataset), and I have the two following questions.
1) The default model size is very large (65M parameters). Since the size was not indicated in the paper, could you please confirm this is what you use ? If the size is different, could you please indicate how is your model different from the default one. BTW, do you think that such a large model is necessary ?
2) The run time at inference with the PC samples (
N=30
,corrector_steps=1
) is about 30 seconds for a batch of 20 samples of length max. 15 seconds. Does that match your experience ?Thanks in advance!! 🤗