lmnt-com / diffwave

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.
Apache License 2.0
754 stars 111 forks source link

The audio have some noise. #25

Closed Ziyan0829 closed 2 years ago

Ziyan0829 commented 2 years ago

Hi, thanks for your good job. I trained the model on a single speaker dataset which have 10000 utterances, loss is shown in figure. In inference, the audio have some clearly noise, is this dataset too small? Or are there other reasons? 捕获 …]()

sharvil commented 2 years ago

The training loss you're getting on your dataset seems higher than what we get out of LJSpeech. Here are some questions that might help get us to better results:

Ziyan0829 commented 2 years ago

1.There no large chunks of silence at the beginning/end of utterances. 2.My datasat is 2~4s/utterance. 3.I used the fast sampling procedure in my inference. I'm sorry that i try to upload the samples but it's failed . It sounds like a gaussian noise, clearly at the begining, and then it gets weaker, eventually the sound gradually becomes unstable(Some jitter).

wangfn commented 2 years ago

@sharvil Thank you for the excellent work. I have a similar issue with the residual gaussian noise in the generated speech files, even without the fast sampling. One example is attached. The dataset used is similar to the one used by @Ziyan0829, however, there're few frames of purely 0s at both the beginning and end of each speech file. Might this be the reason? sample.zip

sharvil commented 2 years ago

I listened to the sample. My best assessment is that the default noise schedule doesn't work well with your dataset. You may need to add more denoising steps and reduce the noise variance at the lower endpoint (i.e. reduce 1e-4 in the noise schedule).

wangfn commented 2 years ago

Thank you for the reply. Mmh, it makes sense. I'll try a noise schedule with 1e-4 reduced.