English synthsis is good, how about Chinese?

lucasjinreal commented 6 years ago

Does this got any blog or attempt on do tts on Chinese?

erogol commented 6 years ago

Never tried sorry but it'd be interesting to see.

dvbfuns commented 5 years ago

Chinese is also good in this model. And compared with other tacotron model, this model can get clear voice with less time. in my test, with same dataset, 10000 steps can synthesis the voice which the quality similar to tacotron 50000 steps.

erogol commented 5 years ago

@dvbfuns great to hear that. Do you have any samples to share? It'd be great to put into the main page, if you don't mind.

lucasjinreal commented 5 years ago

@dvbfuns Which training dataset are u using? A Chinese version TTS would be good to enhance this great repo

dvbfuns commented 5 years ago

@erogol would like to share the samples, just I have problem to access soundcloud.com, any suggestions to do the sharing? or I can share them to you with e-mail ?

erogol commented 5 years ago

@dvbfuns e-mail would work egolge@mozilla.com . Thanks for your help.

erogol commented 5 years ago

@dvbfuns you might even consider PR your Chinese changes. I agree @dvbfuns, that would be great addition.

dvbfuns commented 5 years ago

@erogol , already send your mail with the model and samples, please kindly refer.

lucasjinreal commented 5 years ago

@erogol Would u like update into README or model zoo? @dvbfuns BTW, did u using your own labeling dataset?

erogol commented 5 years ago

@jinfagang I can put whatever @dvbfuns can provide. But also understand if he doesn't like to share the model.

lucasjinreal commented 5 years ago

@erogol Could u resend the voice samples to me? I'd like to check the performance of Chinese result. jinfagang19@gmail.com , thanks in advance

erogol commented 5 years ago

@jinfagang anything I've will be posted on Github as soon as I receive.

erogol commented 5 years ago

I close this due to inactivity. Feel free to reopen.

mazzzystar commented 5 years ago

@jinfagang @erogol Hi！ I'd like to share some Chinese results. You can download [demo.zip]()

And still, Decoder stopped with 'max_decoder_steps will sometimes happen when infer some long sentences(>20). Glad to see if you know good way to handle it.

erogol commented 5 years ago

@mazzzystar Thanks for sharing your results. They sound to me quite okay but I am not a Chinese speaker.

I'd suggest you to replace the stop token layer with a RNN as it was in the previous versions. RNN based model is larger but it is more reliable. Here is a snapshot:

class StopNet(nn.Module):
    r"""
    Predicting stop-token in decoder.

    Args:
        r (int): number of output frames of the network.
        memory_dim (int): feature dimension for each output frame.
    """

    def __init__(self, r, memory_dim):
        super(StopNet, self).__init__()
        self.rnn = nn.GRUCell(memory_dim * r, memory_dim * r)
        self.relu = nn.ReLU()
        self.linear = nn.Linear(r * memory_dim, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, inputs, rnn_hidden):
        """
        Args:
            inputs: network output tensor with r x memory_dim feature dimension.
            rnn_hidden: hidden state of the RNN cell.
        """
        rnn_hidden = self.rnn(inputs, rnn_hidden)
        outputs = self.relu(rnn_hidden)
        outputs = self.linear(outputs)
        outputs = self.sigmoid(outputs)
        return outputs, rnn_hidden

mazzzystar commented 5 years ago

@erogol Thanks for your reply, I will try out. Actually in Chinese, it's really important to know where to pause and how long should it pause in a single sentence , normally pause happens several times and if all pause are correct, the result will be considered as "good naturalness" . And as far as I know, my model based on mozilla-TTS outperform most current Mandarin Chinese TTS in naturalness, thanks for your work !

One part I think need to be improved is that, the voice texture is still a little bit "electronic" and unlike real human, though it's good enough. I may start to focus on this part and try out some methods, such as different Vocoder, or other attention method. BTW have you considered of using Transformer to replace current RNN part ? I noticed that more and more people prefer Transformer than RNN after BERT came out.

Finally, thanks again for your great work !

erogol commented 5 years ago

@mazzzystar Thanks for your words :). Yeah I'd guess things would be much better, if we could combine TTS with a neural vocoder. It is in progress but, we need sometime to solve some internal technicalities before we continue. You could also try World vocoder. There is a discussion about it in issues as well with some example scripts to help you. It shouldn't be so hard.

I'd say attention is more about laying the right pronunciation but naturalness is a matter of the vocoder. You can also try attention windowing implemented in dev branch layers/attention.py. It would give better monotonic attention with less noise. Based on the window size you can also barely define the pace of the speech. You can also try to multiply attention weights with ~4 before applying normalization. That would also lead to more clear alignment.

When it comes to BERT, I've not tried yet. One problem with BERT, it requires more memory compared to RNN. Therefore it might be edgy in low budget systems to train which I prefer to stay away. However, if you like to try, I am here to help.

Thanks again!

lucasjinreal commented 5 years ago

@mazzzystar Your Chinese result is really impressive! May I ask which Chinese voice corpus did you use? Or which way did u organize your data?

mazzzystar commented 5 years ago

@jinfagang Sorry, I can't tell you the detail for it's one of my current work, and may hurts company's interest. Hope you can understand. I'm here just to let you know mozilla-TTS works well on Chinese synthesis.

OswaldoBornemann commented 5 years ago

@mazzzystar hello man, the demo.zip file seems not work. How could i download it ?

OswaldoBornemann commented 5 years ago

@mazzzystar @jinfagang @dvbfuns @erogol yes, i also tried it out in chinese corpus. The model just get a better alignment than the other tacotron2 project, especially nvidia/tacotron2. But I haven't tried to listen the voice synthsis effectiveness

lucasjinreal commented 5 years ago

@tsungruihon Which repo are u using?

OswaldoBornemann commented 5 years ago

@jinfagang just use mozilla TTS

lucasjinreal commented 5 years ago

@tsungruihon Sorry, I mean, which corpus

OswaldoBornemann commented 5 years ago

@jinfagang audio that post in some app.

puppyapple commented 4 years ago

Hello @erogol, thanks for you great work! I'm new to TTS domain and trying to adapt your repo to some Chinese dataset(10000 sentences, 12H). Training is still ongoing but seems promising. I have several doubts when looking into details, hope that you could give me some advices:

I noticed that for character(use_phonemes=false) training mode, we don't have an 'enable_eos_bos' option to add end token to the end of sentences which I saw a lot in some other discussions like Nvidia/Tacotron2, but just let the model learn through stopnet, so in this case should I always waiting for the stop loss converges to zero? For now, my alignment has always gaps after the stop point like showed below(along with the 'Decoder stopped with 'max_decoder_steps' warning, so I can assume that the model does not learn when to stop. Why not add stop token here to help?)
For the training time, I saw your shared pretrained models with LJSpeech on GoogleDrive where you trained 160k steps with 16 batch size. So my question is, should we care about the eval loss to stop training or just let the training continue so long as the training loss improves(overfitting?)
When I try with repo of NVIDIA/Tacotron2 there is problem with the restore training(loss spike after first step and model starts from scratch), which I found is probably related to the Adam optimizier, have you ever encountered such issue? Thanks!

erogol commented 4 years ago

it should learn to stop after enough training and it is more reliable than using eos. You can also try eos , otherwise.
eval or train loss does not exactly show the final performance. The best is to check yourself for the best sounding model.
in my implementation fine-tuning should work flawlessly.

puppyapple commented 4 years ago

@erogol thanks for the reply, now I'm training without forward attention and the problem in the figure above seems dissapeared for now, I will wait for longer to see what il will become. For the fine-tuning, unfortunately I don't even have the chance to get a loss spike because I could not launch restore(or continue) traing due to the issue that I described here https://github.com/mozilla/TTS/issues/318. Any idea for this? I tried many modifications but none of them worked.

puppyapple commented 4 years ago

@erogol Hello erogol, thanks for your great work and replies for my questions. I finally succeeded to train a tacotron2 model with a public Chinese dataset, as well as a WaveRNN using predicted mels. The results sound good. I'd like to share some audio samples here in a few days. And following https://github.com/mozilla/TTS/issues/26, I'm now trying to finetune the tacotron2 with 'BN' prenet, the improvement of loss is significant! Nearly the same as the figures you shared. The training is still on going and I will compare the audios created after. Just a small doubt, after finetuning with 'BN' prenet, is it necessary to retrain(or finetune) my WaveRNN model with the new predicted mels? Thanks!

erogol commented 4 years ago

@puppyapple Great to hear that !!

Your question ... if you train wavernn with the final mel specs you are likely to get better results. However, without that it should sound good enough.

puppyapple commented 4 years ago

@erogol OK. Then I think I will give it a try anyway! 😁

puppyapple commented 4 years ago

Here are two samples from my Tacotron 2 + WaveRNN using dev branch of this repo, thanks for your work! The alignment is showed in figure(forward attention is enabled during inference). It seems the 'target' parameter has significant impact on voice quality: the audio with target=4000 sounds 'trembling' than the other one with target=22000 which is much more 'clean'. samples.zip

lucasjinreal commented 4 years ago

@puppyapple Amazing, the result is the most good I have ever seen on Chinese dataset. Will u share some branch on this?

puppyapple commented 4 years ago

@jinfagang Thanks, nothing special has been added. You could check my forked code which are all from @erogol 's work. Few modifications are made to fit Chinese data(Biaobei 10000)

OswaldoBornemann commented 4 years ago

@puppyapple would you mind sharing your config.json file ?

lucasjinreal commented 4 years ago

@puppyapple On which branch? How to prepare for training on Biaobai?

puppyapple commented 4 years ago

@jinfagang @tsungruihon All is in dev branch. For Biaobei dataset I have not made any extra preparations, just followed the implementation in erogol's and got positive results. But still, this public dataset is too small and is lack of punctuation symbols in the scripts, not all sentences synthesised are as natural as showed in my samples, some have also bad or wrong punctuations. In general the results are not bad.

OswaldoBornemann commented 4 years ago

@puppyapple thanks my friend. It seems that you use Tacotron2 with location sensitive attention instead of forward attention, according to the config.json from your dev branch.

puppyapple commented 4 years ago

@tsungruihon yes and I also finetuned with BN prenet like erogol described in https://github.com/mozilla/TTS/issues/26.

shad94 commented 4 years ago

@puppyapple, I got two questions, since I am new to the project:

Have you changed content of files in TTS/tests for purpose of Chinese? The same with TTS/mozilla-us-phonemes
How to generate encoder VS decoder graph? Thank you

puppyapple commented 4 years ago

@shad94

I didn't use TTS/tests for testing, but with the benchmark jupyter notebook in TTS/notebooks with some modifications;
It's already implemented by erogol in the logger class.

OswaldoBornemann commented 4 years ago

@puppyapple . Thanks my friend.

WhiteFu commented 4 years ago

@puppyapple I find the audio that you offer is 48000Hz. your sample_rate in config.json is 48000? Because upsampling(22kHz -> 48kHz) doesn't have high frequency details .

puppyapple commented 4 years ago

@WhiteFu Yes, since the Biaobei dataset is 48khz, I just keep it the way as it is, without any upsampling.

WhiteFu commented 4 years ago

Thank you for your reply. I will check more details in you fork branch:)

chynphh commented 4 years ago

@erogol @puppyapple Hi, I am a newbie in this area. I'm trying to use TTS2 to train a Chinese muti-speaker model. Here are my samples. And I have some questions.

Generated audio files are understandable but very noisy(the samples are in samples/phonemes/120Kstep/). I done not use any vocode(GL or WaveRNN). Is this normal? How to deal with this problem? Using a vocode or any other idea?
For Chinese, is it better to use pinyin or phonemes? When I use phonemes, some tones are not accurate, like a non-native speaker speaks Chinese. My model using pinyin has not yet converged.
Why is there a big difference between training and testing? I set the same parameters for the function synthesis. The results of test-text in training(train.py) are much better than in testing(Benchmark.ipynb). The training time samples are in samples/without_phonemes(use pinyin)/29037steps and samples/without_phonemes(use pinyin)/30886steps" The testing time samples are in samples/without_phonemes(use pinyin)/30000steps.
Is there a big difference between training WaveRNN with raw wav files or TTS2 model? Which is better? Is there a guide to training the WaveRNN model?

The format of the file name is {text}-{speaker id}-{train steps}. Thank you very much! :)

puppyapple commented 4 years ago

@chynphh Since I'm also fresh in TTS domain, I can only try to answer you question from my own point of view, which may be not correct.

It is sure that using a vocoder will give better audio quality. In this repo, erogol has already implemented GL to generate test audio for tensorboard display, have you listened to the result? I've tried both WaveRNN and ParallelWaveGAN, WaveRNN could get high quality but with large 'overlap' parameters which will increase inference time. ParallelWaveGAN result is a little noisy but not quite obvious, however it is much more faster.
In my own test, pinyin is sufficient to get good pronunciation.
30k steps seems far from enough, you could wait longer.
I have not tried Ground Truth mel from raw wav file for WaveRNN, Tacotron 2 generated mels seem to work well. You can try to understand erogol's implementation and give it a try, for me it's clear enough.

chynphh commented 4 years ago

@puppyapple thanks for your reply!

After my experiments, using Pinyin is indeed better than phonemes. I trained Tacotron 2 with 240K steps. The results were good but still a bit noisy. Now, I'm trying to train a WaveRNN model. I tried to use the mels generated by Tacsotron 2, but it cannot work with the raw wav file. It seems to be caused by a mismatch between the raw wav file and mels generated by Tacsotron 2( https://github.com/mozilla/TTS/issues/26#issuecomment-574073157). So, I trained WaveRNN with raw wav files and Ground Truth mels. Until now, it hasn't worked with 180K steps. When training WaveRNN with mels from Tacotron2, which wavs do you use, the ground truth wavs file or the wavs generated by Tacotron2?

puppyapple commented 4 years ago

@chynphh mels generated by trained Tacotron2 model as input and ground truth audio files as target. Have you extracted mels using the right config? You could refer to the benchmark notebook in this repo to do that, maybe a few modifications are needed. For https://github.com/mozilla/TTS/issues/26#issuecomment-574073157, maybe try to locate the out of range sample to find out the reason(like 'hop_length' mismatch, etc.)

chynphh commented 4 years ago

@puppyapple Thanks for your suggestions and answers, I will double check my code.

mozilla / TTS

English synthsis is good, how about Chinese? #58