Closed PranjalyaDS closed 10 months ago
Hi @PranjalyaDS
We have experienced a similar result from the reference audio that contains a long silence speech.
The main reason is the blank token conditioned by these audio.
The LibriTTS dataset has a little silence, so the trained model with LibriTTS generates a little silence in the synthetic speech.
But, if you utilized a dataset with a long silence for training, the model would synthesize a speech with random pause..
In my opinion, there are four solutions for this.
Using the silence trimming for the training dataset. (But you should trim the speech carefully for your dataset)
Using the silence trimming for the reference audio
Restricting the maximum duration of the blank token for the predicted duration from the duration predictor.
Modifiying TTV with external duration modules such as MFA by removing a MAS and blank tokens.
Thanks
Thank you very much @sh-lee-prml. That makes total sense. I will test it out.
Hi, thanks again for open-sourcing the models. I have been training the model so far on Hindi dataset, but I have noticed there seems to be random abrupt pauses in the sentence on generation. Have attached the sample. (From 120k checkpoint) So, just wanted to know if you have observed this previously with your training as well / what could be the cause of it?
https://github.com/sh-lee-prml/HierSpeechpp/assets/86353867/83753e73-f0da-462e-b49c-6b585c17d35f
[I understand that you may not know Hindi, but I think the pauses at 3 seconds and 8 seconds are pretty noticeable] Thanks again!