ivanvovk / WaveGrad

Implementation of WaveGrad high-fidelity vocoder from Google Brain in PyTorch.
BSD 3-Clause "New" or "Revised" License
402 stars 56 forks source link

schedules model for other dataset and different sample rate #16

Closed Liujingxiu23 closed 3 years ago

Liujingxiu23 commented 3 years ago

I am not fully understand the Noise schedules . Is the model in schedules/pretrained suitable for other dataset , 22k and 16k?

I tried to train my own dataset whose sample-rate is 16000, and use the pretrained schedules model(16, 25 and 100 iters), the predict results sound good, especially using 100 iters.

But I don't understand, why the schedules model can also used for 16k sample-rate? Or though the synthsized wavs are good, it is not the correct way?

ivanvovk commented 3 years ago

Hi @Liujingxiu23.

I think rightly-found noise schedule is significantly dataset-dependent only on extremely small number of iterations.

Authors in the paper note: to have good audio reconstruction quality on less than 1000 iterations you should start noise schedule with small beta values, since they make the most impact on removing static noise. In that case, to extract "pretrained" 12-, 25-, 50- and 100-iteration schemes I used some exponential-type approach (see 25 iters graph I attached). Since during training you always set constant schedule to be linspace(1e-6, 1e-2, steps=1000), it doesn't matter what type of data you train - the lower-iteration denoising trajectory would always be the same, more or less. Thanks to the strong conditions of mel-spectrogram and direct noise level.

Of course, on 6 iterations I assume it wouldn't work so well on new dataset, since 6 points is a very small number to reconstruct the right trajectory.

This is my view.

Liujingxiu23 commented 3 years ago

@ivanvovk Thank you very much for you reply. The following is my understanding, can you help me point out whether they are right or not. 1.The noise schedule is designed but not trained, in a trajectory that more or less the same. If I want to get a good result using iters=6 on my own database, I should try different betas setting that should have similar trajectory like other iterations 2.In the stage of inference, the iteration is from iter-25, 24, 23.... to 1, and the noise level is 0.3096, 0.5343, 0.6959....1.0, and the corresponding beta value is 0.66428 0.41055 0.25373....0.000007(from large to small).
Is that mean that form pure Gaussian noise to coarse-wave and than to refined-wave, the corresponding noise is from big to small, as the beta value?

There is another question about the generation of waves: In the paper, when y_recon is got , y_t is computed as:

捕获

In the code version of lmnt-com, the related code is simple: _sigma = ((1.0 - alpha_cum[n-1]) / (1.0 - alphacum[n]) * beta[n])*0.5 audio += sigma noise

But in this version, not only log_vari but also mean-value is computed and used for getting a new y_t: _model_mean, model_log_variance = self.p_mean_variance(mels, y, t, clip_denoised) eps = torch.randn_like(y) if t > 0 else torch.zeros_like(y) return model_mean + eps (0.5 model_logvariance).exp() I do not understand what is the mean and log_vari, and why you compure y_t in this way.

ivanvovk commented 3 years ago

@Liujingxiu23 Commenting your questions:

  1. Once again, noise schedule (betas) is set to be constant 1000 values from 1e-6 to 1e-2 (in diffusion order). It sets the noise levels with which we destroy our original distribution during forward diffusion process. At training stage, conditioned on mel-spectrogram and these noise levels our model learns the perturbations made to the data point by approximating the exact injected noise. During inference we want to refine the reverse trajectory of point destruction. Basically, the trajectory variance is linear (during training we set betas to be linspace(1e-6, 1e-2, 1000)). On practice, when constructing lower-iteration schemes, it happens, that for good perceptual quality of waveform restoration your diffusion noise schedule should start with small betas (thus I have built the schedules in exponential way). But even so, 6 iterations - is a very small number to reconstruct fine-grained structures of waveform, however, authors show that if you run grid search, you may find a suitable one.

  2. Didn't got this question. betas are ordered in ascending manner for diffusion process (descending for generation). alphas are computed as 1 - betas, thus they are of descending order for diffusion (ascending for generation). Noise levels are computed as sqrt(alphas_cumprod) - cumulative product doesn't change order since alphas are of range [0, 1] and sqrt doesn't make any effect on order, so noise levels as alphas are ordered in descending manner for diffusion (ascending for generation).

  3. To reconstruct the reverse denoising process, you need to know Gaussian transitions of diffusion process: mean and variance. Thanks to the model architecture you can do that analytically by estimating denoising posteriors q(y_{t-1} | y_t, y_0). Code base of lmnt-com guys is equivalent, they wrote it in a single formula as paper suggests but code syntax lost the probabilistic logic. See original DDPM paper and issue created by main WaveGrad author Nanxin Chen.

ivanvovk commented 3 years ago

Closing issue due inactivity. Feel free to make a new issue or reopen this one if you have another questions.