Closed Ziyan0829 closed 2 years ago
The training loss you're getting on your dataset seems higher than what we get out of LJSpeech. Here are some questions that might help get us to better results:
1.There no large chunks of silence at the beginning/end of utterances. 2.My datasat is 2~4s/utterance. 3.I used the fast sampling procedure in my inference. I'm sorry that i try to upload the samples but it's failed . It sounds like a gaussian noise, clearly at the begining, and then it gets weaker, eventually the sound gradually becomes unstable(Some jitter).
@sharvil Thank you for the excellent work. I have a similar issue with the residual gaussian noise in the generated speech files, even without the fast sampling. One example is attached. The dataset used is similar to the one used by @Ziyan0829, however, there're few frames of purely 0s at both the beginning and end of each speech file. Might this be the reason? sample.zip
I listened to the sample. My best assessment is that the default noise schedule doesn't work well with your dataset. You may need to add more denoising steps and reduce the noise variance at the lower endpoint (i.e. reduce 1e-4 in the noise schedule).
Thank you for the reply. Mmh, it makes sense. I'll try a noise schedule with 1e-4 reduced.
Hi, thanks for your good job. I trained the model on a single speaker dataset which have 10000 utterances, loss is shown in figure. In inference, the audio have some clearly noise, is this dataset too small? Or are there other reasons? …]()