Closed lucasjinreal closed 5 years ago
Never tried sorry but it'd be interesting to see.
Chinese is also good in this model. And compared with other tacotron model, this model can get clear voice with less time. in my test, with same dataset, 10000 steps can synthesis the voice which the quality similar to tacotron 50000 steps.
@dvbfuns great to hear that. Do you have any samples to share? It'd be great to put into the main page, if you don't mind.
@dvbfuns Which training dataset are u using? A Chinese version TTS would be good to enhance this great repo
@erogol would like to share the samples, just I have problem to access soundcloud.com, any suggestions to do the sharing? or I can share them to you with e-mail ?
@dvbfuns e-mail would work egolge@mozilla.com . Thanks for your help.
@dvbfuns you might even consider PR your Chinese changes. I agree @dvbfuns, that would be great addition.
@erogol , already send your mail with the model and samples, please kindly refer.
@erogol Would u like update into README or model zoo? @dvbfuns BTW, did u using your own labeling dataset?
@jinfagang I can put whatever @dvbfuns can provide. But also understand if he doesn't like to share the model.
@erogol Could u resend the voice samples to me? I'd like to check the performance of Chinese result. jinfagang19@gmail.com , thanks in advance
@jinfagang anything I've will be posted on Github as soon as I receive.
I close this due to inactivity. Feel free to reopen.
@jinfagang @erogol Hi! I'd like to share some Chinese results. You can download [demo.zip]()
And still, Decoder stopped with 'max_decoder_steps
will sometimes happen when infer some long sentences(>20). Glad to see if you know good way to handle it.
@mazzzystar Thanks for sharing your results. They sound to me quite okay but I am not a Chinese speaker.
I'd suggest you to replace the stop token layer with a RNN as it was in the previous versions. RNN based model is larger but it is more reliable. Here is a snapshot:
class StopNet(nn.Module):
r"""
Predicting stop-token in decoder.
Args:
r (int): number of output frames of the network.
memory_dim (int): feature dimension for each output frame.
"""
def __init__(self, r, memory_dim):
super(StopNet, self).__init__()
self.rnn = nn.GRUCell(memory_dim * r, memory_dim * r)
self.relu = nn.ReLU()
self.linear = nn.Linear(r * memory_dim, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, inputs, rnn_hidden):
"""
Args:
inputs: network output tensor with r x memory_dim feature dimension.
rnn_hidden: hidden state of the RNN cell.
"""
rnn_hidden = self.rnn(inputs, rnn_hidden)
outputs = self.relu(rnn_hidden)
outputs = self.linear(outputs)
outputs = self.sigmoid(outputs)
return outputs, rnn_hidden
@erogol Thanks for your reply, I will try out. Actually in Chinese, it's really important to know where to pause and how long should it pause in a single sentence , normally pause happens several times and if all pause are correct, the result will be considered as "good naturalness" . And as far as I know, my model based on mozilla-TTS outperform most current Mandarin Chinese TTS in naturalness, thanks for your work !
One part I think need to be improved is that, the voice texture is still a little bit "electronic" and unlike real human, though it's good enough. I may start to focus on this part and try out some methods, such as different Vocoder, or other attention method. BTW have you considered of using Transformer
to replace current RNN part ? I noticed that more and more people prefer Transformer
than RNN after BERT came out.
Finally, thanks again for your great work !
@mazzzystar Thanks for your words :). Yeah I'd guess things would be much better, if we could combine TTS with a neural vocoder. It is in progress but, we need sometime to solve some internal technicalities before we continue. You could also try World vocoder. There is a discussion about it in issues as well with some example scripts to help you. It shouldn't be so hard.
I'd say attention is more about laying the right pronunciation but naturalness is a matter of the vocoder. You can also try attention windowing implemented in dev branch layers/attention.py
. It would give better monotonic attention with less noise. Based on the window size you can also barely define the pace of the speech. You can also try to multiply attention weights with ~4 before applying normalization. That would also lead to more clear alignment.
When it comes to BERT, I've not tried yet. One problem with BERT, it requires more memory compared to RNN. Therefore it might be edgy in low budget systems to train which I prefer to stay away. However, if you like to try, I am here to help.
Thanks again!
@mazzzystar Your Chinese result is really impressive! May I ask which Chinese voice corpus did you use? Or which way did u organize your data?
@jinfagang Sorry, I can't tell you the detail for it's one of my current work, and may hurts company's interest. Hope you can understand. I'm here just to let you know mozilla-TTS works well on Chinese synthesis.
@mazzzystar hello man, the demo.zip
file seems not work. How could i download it ?
@mazzzystar @jinfagang @dvbfuns @erogol yes, i also tried it out in chinese corpus. The model just get a better alignment than the other tacotron2 project, especially nvidia/tacotron2
. But I haven't tried to listen the voice synthsis effectiveness
@tsungruihon Which repo are u using?
@jinfagang just use mozilla TTS
@tsungruihon Sorry, I mean, which corpus
@jinfagang audio that post in some app.
Hello @erogol, thanks for you great work! I'm new to TTS domain and trying to adapt your repo to some Chinese dataset(10000 sentences, 12H). Training is still ongoing but seems promising. I have several doubts when looking into details, hope that you could give me some advices:
it should learn to stop after enough training and it is more reliable than using eos. You can also try eos , otherwise.
eval or train loss does not exactly show the final performance. The best is to check yourself for the best sounding model.
in my implementation fine-tuning should work flawlessly.
@erogol thanks for the reply, now I'm training without forward attention and the problem in the figure above seems dissapeared for now, I will wait for longer to see what il will become. For the fine-tuning, unfortunately I don't even have the chance to get a loss spike because I could not launch restore(or continue) traing due to the issue that I described here https://github.com/mozilla/TTS/issues/318. Any idea for this? I tried many modifications but none of them worked.
@erogol Hello erogol, thanks for your great work and replies for my questions. I finally succeeded to train a tacotron2 model with a public Chinese dataset, as well as a WaveRNN using predicted mels. The results sound good. I'd like to share some audio samples here in a few days. And following https://github.com/mozilla/TTS/issues/26, I'm now trying to finetune the tacotron2 with 'BN' prenet, the improvement of loss is significant! Nearly the same as the figures you shared. The training is still on going and I will compare the audios created after. Just a small doubt, after finetuning with 'BN' prenet, is it necessary to retrain(or finetune) my WaveRNN model with the new predicted mels? Thanks!
@puppyapple Great to hear that !!
Your question ... if you train wavernn with the final mel specs you are likely to get better results. However, without that it should sound good enough.
@erogol OK. Then I think I will give it a try anyway! 😁
Here are two samples from my Tacotron 2 + WaveRNN using dev branch of this repo, thanks for your work! The alignment is showed in figure(forward attention is enabled during inference). It seems the 'target' parameter has significant impact on voice quality: the audio with target=4000 sounds 'trembling' than the other one with target=22000 which is much more 'clean'. samples.zip
@puppyapple Amazing, the result is the most good I have ever seen on Chinese dataset. Will u share some branch on this?
@jinfagang Thanks, nothing special has been added. You could check my forked code which are all from @erogol 's work. Few modifications are made to fit Chinese data(Biaobei 10000)
@puppyapple would you mind sharing your config.json
file ?
@puppyapple On which branch? How to prepare for training on Biaobai?
@jinfagang @tsungruihon All is in dev branch. For Biaobei dataset I have not made any extra preparations, just followed the implementation in erogol's and got positive results. But still, this public dataset is too small and is lack of punctuation symbols in the scripts, not all sentences synthesised are as natural as showed in my samples, some have also bad or wrong punctuations. In general the results are not bad.
@puppyapple thanks my friend. It seems that you use Tacotron2
with location sensitive attention
instead of forward attention
, according to the config.json
from your dev
branch.
@tsungruihon yes and I also finetuned with BN prenet like erogol described in https://github.com/mozilla/TTS/issues/26.
@puppyapple, I got two questions, since I am new to the project:
@shad94
@puppyapple . Thanks my friend.
@puppyapple I find the audio that you offer is 48000Hz. your sample_rate in config.json is 48000? Because upsampling(22kHz -> 48kHz) doesn't have high frequency details .
@WhiteFu Yes, since the Biaobei dataset is 48khz, I just keep it the way as it is, without any upsampling.
Thank you for your reply. I will check more details in you fork branch:)
@erogol @puppyapple Hi, I am a newbie in this area. I'm trying to use TTS2 to train a Chinese muti-speaker model. Here are my samples. And I have some questions.
samples/phonemes/120Kstep/
). I done not use any vocode(GL or WaveRNN). Is this normal?
How to deal with this problem? Using a vocode or any other idea?synthesis
. The results of test-text in training(train.py) are much better than in testing(Benchmark.ipynb). The training time samples are in samples/without_phonemes(use pinyin)/29037steps
and samples/without_phonemes(use pinyin)/30886steps"
The testing time samples are in samples/without_phonemes(use pinyin)/30000steps
.The format of the file name is {text}-{speaker id}-{train steps}
.
Thank you very much! :)
@chynphh Since I'm also fresh in TTS domain, I can only try to answer you question from my own point of view, which may be not correct.
@puppyapple thanks for your reply!
After my experiments, using Pinyin is indeed better than phonemes. I trained Tacotron 2 with 240K steps. The results were good but still a bit noisy. Now, I'm trying to train a WaveRNN model. I tried to use the mels generated by Tacsotron 2, but it cannot work with the raw wav file. It seems to be caused by a mismatch between the raw wav file and mels generated by Tacsotron 2( https://github.com/mozilla/TTS/issues/26#issuecomment-574073157). So, I trained WaveRNN with raw wav files and Ground Truth mels. Until now, it hasn't worked with 180K steps. When training WaveRNN with mels from Tacotron2, which wavs do you use, the ground truth wavs file or the wavs generated by Tacotron2?
@chynphh mels generated by trained Tacotron2 model as input and ground truth audio files as target. Have you extracted mels using the right config? You could refer to the benchmark notebook in this repo to do that, maybe a few modifications are needed. For https://github.com/mozilla/TTS/issues/26#issuecomment-574073157, maybe try to locate the out of range sample to find out the reason(like 'hop_length' mismatch, etc.)
@puppyapple Thanks for your suggestions and answers, I will double check my code.
Does this got any blog or attempt on do tts on Chinese?