Adopting diffusion model on TTS

keonlee9420 commented 3 years ago

Hi all, I'm currently playing with DiffSinger, which is a TTS system extended by diffusion models. For the naive version, It consists of encoders (for embedding text and pitch information) and a denoiser where the encoders' output is used to condition the denoiser. Everything is similar to diffwave including denoiser's structure and prediction but the neural net to predict epsilon would be changed to epsilon(noisy_spectrogram, encoder_outputs, diffusion_step) compared to DiffWave's epsilon(noisy_audio, upsampled_spectrogram, diffusion_step). While I'm successfully training encoders, I got an issue during training denoiser. I used LJSpeech. Here is what I did:

First of all, as a preliminary experiment, I try to check all modules to work well by setting denoiser as epsilon(noisy_spectrogram, clean_spectrogram, diffusion_step) to predict the noisy_spectrogram.
After the model converges, I went back to the denoiser of epsilon(noisy_spectrogram, encoder_outputs, diffusion_step) to predict clean_spectrogram. I detached the encoders_output from the auto_grad when the input (to prevent from updating) to the denoiser to fix the conditioner for model convergence. The model was broken when I didn't detach (allow the encoder to be updated during denoiser training).
I found that when the range of the conditioner (encoder_outputs) values is smaller, then the model shows better evidence of successful training.

Bellows are the results I've got so far. The upper one is the sampled (synthesized) mel-spectrogram, and the lower one is the ground truth of each image.

I can see the model converge during the primary experiment:
When the encoder's output directly input to the denoiser (value range: -9.xxx to 6.xxx):
When the encoder's output is multiplied by 0.01 to shrink the range:

For case 2., It shows any clues on training. On contrary, the case 3. shows 'some' levels of training but it is not what we expected. I double-checked the inference part (reverse part), but it is exactly the same as that of 1. and diffwave.

So I just want to know if you have any idea on the successful conditions of the input conditioner of the denoiser. Why does the model show such an unsatisfying result above? Do I miss something to process the conditioner?

I will appreciate all suggestions or sharing of your experience. Thanks in advance.

sharvil commented 3 years ago

The diffusion process assumes that the signal is in the range [-1, 1]. I'll note that in the paper you referenced, the authors state:

The mel-spectrograms are linearly scaled to the range [-1, 1], and F0 is normalized to have zero mean and unit variance.

Since your mel spectrograms are in [-9.xxx, 6.xxx], you'll need to shift and scale appropriately. Your third experiment is the right idea, but it makes the mel spectrogram values too small.

keonlee9420 commented 3 years ago

Thank you for the reply. Actually, I already followed the normalizing of the mel and F0 as paper said and what I mentioned is the encoder's output (the combination of the hidden representation of text and pitch information) so that the range [-9.xxx, 6.xxx] is the range of this combination (the conditioners for the epsilon), rather than the spectrogram itself (all mel were already normalized in [-1, 1] during experiments).

Besides, inspired by the range constraint on mel, I also tried to normalize the encoder's output by its min and max for whole datasets to set a range between [-1, 1] but it shows following result which is even worse than the third experiment: (torch.tanh() is also tested but fails)

I think I tested all options inferred by your works and DiffSinger paper, but could not get any satisfying results. And that's the reason why I ask about the possibilities of other constraint that the 'conditioner' must follow (or other configurations).

sharvil commented 3 years ago

Ah, sorry for misunderstanding. You're right – the mel spectrogram should be in [-1, 1] but the conditioner doesn't need that constraint. I'm not really sure why your model is failing to converge.

keonlee9420 commented 3 years ago

Why can DiffWave be trained without GaussianDiffusion-like module as in Denoising Diffusion Probabilistic Models? It seems that DiffWave is only constructed by unet of above repository.

sharvil commented 3 years ago

I haven't looked through the DDPM codebase very closely, but the core ideas behind the GaussianDiffusion module are present in this repository. For example, the loss function looks similar to the loss computation in this project.

DiffWave uses a WaveNet-like architecture instead of a unet, but the fundamental ideas behind the DDPM paper and DiffWave are similar: train a neural network to estimate the noise for every step of the diffusion process.

keonlee9420 commented 3 years ago

I finally succeeded in training! I used the GaussianDiffusion module to make the model converge, and this idea is hugely based on the original authors' suggestion.

You can find the code and pre-trained model here: https://github.com/keonlee9420/DiffSinger

Thank you.

sharvil commented 3 years ago

Great, I'm glad to hear that. What was the specific issue that caused your model to fail in previous experiments?

keonlee9420 commented 3 years ago

It's mainly because of the detach() action during conditioning the denoiser. The model should be trained jointly for both encoders and the denoiser. Furthermore, there is no need to constraint the output of the encoder.

It was beneficial for me to drill down DiffWave first, figuring out which factors differentiate the training process, particularly input variants (audio versus Mel-spectrograms).

I hope that those who study speech domain can get insights from my experiments.

sharvil commented 3 years ago

Yeah, that makes sense. Thanks for sharing!

lmnt-com / diffwave

Adopting diffusion model on TTS #14