Closed keonlee9420 closed 3 years ago
The diffusion process assumes that the signal is in the range [-1, 1]
. I'll note that in the paper you referenced, the authors state:
The mel-spectrograms are linearly scaled to the range [-1, 1], and F0 is normalized to have zero mean and unit variance.
Since your mel spectrograms are in [-9.xxx, 6.xxx]
, you'll need to shift and scale appropriately. Your third experiment is the right idea, but it makes the mel spectrogram values too small.
Thank you for the reply. Actually, I already followed the normalizing of the mel and F0 as paper said and what I mentioned is the encoder's output (the combination of the hidden representation of text and pitch information) so that the range [-9.xxx, 6.xxx]
is the range of this combination (the conditioners for the epsilon
), rather than the spectrogram itself (all mel were already normalized in [-1, 1]
during experiments).
Besides, inspired by the range constraint on mel, I also tried to normalize the encoder's output by its min and max for whole datasets to set a range between [-1, 1]
but it shows following result which is even worse than the third experiment:
(torch.tanh()
is also tested but fails)
I think I tested all options inferred by your works and DiffSinger paper, but could not get any satisfying results. And that's the reason why I ask about the possibilities of other constraint that the 'conditioner' must follow (or other configurations).
Ah, sorry for misunderstanding. You're right – the mel spectrogram should be in [-1, 1]
but the conditioner doesn't need that constraint. I'm not really sure why your model is failing to converge.
Why can DiffWave be trained without GaussianDiffusion-like module as in Denoising Diffusion Probabilistic Models?
It seems that DiffWave is only constructed by unet
of above repository.
I haven't looked through the DDPM codebase very closely, but the core ideas behind the GaussianDiffusion
module are present in this repository. For example, the loss function looks similar to the loss computation in this project.
DiffWave uses a WaveNet-like architecture instead of a unet, but the fundamental ideas behind the DDPM paper and DiffWave are similar: train a neural network to estimate the noise for every step of the diffusion process.
I finally succeeded in training! I used the GaussianDiffusion module to make the model converge, and this idea is hugely based on the original authors' suggestion.
You can find the code and pre-trained model here: https://github.com/keonlee9420/DiffSinger
Thank you.
Great, I'm glad to hear that. What was the specific issue that caused your model to fail in previous experiments?
It's mainly because of the detach()
action during conditioning the denoiser. The model should be trained jointly for both encoders and the denoiser. Furthermore, there is no need to constraint the output of the encoder.
It was beneficial for me to drill down DiffWave first, figuring out which factors differentiate the training process, particularly input variants (audio versus Mel-spectrograms).
I hope that those who study speech domain can get insights from my experiments.
Yeah, that makes sense. Thanks for sharing!
Hi all, I'm currently playing with DiffSinger, which is a TTS system extended by diffusion models. For the naive version, It consists of encoders (for embedding text and pitch information) and a denoiser where the encoders' output is used to condition the denoiser. Everything is similar to diffwave including denoiser's structure and prediction but the neural net to predict epsilon would be changed to
epsilon(noisy_spectrogram, encoder_outputs, diffusion_step)
compared to DiffWave'sepsilon(noisy_audio, upsampled_spectrogram, diffusion_step)
. While I'm successfully training encoders, I got an issue during training denoiser. I used LJSpeech. Here is what I did:epsilon(noisy_spectrogram, clean_spectrogram, diffusion_step)
to predict thenoisy_spectrogram
.epsilon(noisy_spectrogram, encoder_outputs, diffusion_step)
to predict clean_spectrogram. I detached the encoders_output from the auto_grad when the input (to prevent from updating) to the denoiser to fix the conditioner for model convergence. The model was broken when I didn't detach (allow the encoder to be updated during denoiser training).Bellows are the results I've got so far. The upper one is the sampled (synthesized) mel-spectrogram, and the lower one is the ground truth of each image.
For case 2., It shows any clues on training. On contrary, the case 3. shows 'some' levels of training but it is not what we expected. I double-checked the inference part (reverse part), but it is exactly the same as that of 1. and diffwave.
So I just want to know if you have any idea on the successful conditions of the input conditioner of the denoiser. Why does the model show such an unsatisfying result above? Do I miss something to process the conditioner?
I will appreciate all suggestions or sharing of your experience. Thanks in advance.