Model Release: Tacotron2 with Forward Attention - LJSpeech

erogol commented 4 years ago

Model Link: https://drive.google.com/open?id=10ymOlWHutqTtfDYhIbHULn2IKDKP0O9m Colab example: https://colab.research.google.com/drive/1cpofjnfKSpFhiREgExENIsum4MrqxyPR

This model is trained with Forward Attention enabled until ~400K iters and then finetuned with Batch Norm prenet until the end. It is the best model so far trained.

I observe once again that using BN based prenet improves the spectrogram quality considerablly but if you train it from scratch, model does not learn the attention.

You can also use this TTS model with PWGAN or WaveRNN vocoders. PWGAn provides real-time voice synthesis and WaveRNN is slower but provides better quality.

https://github.com/erogol/ParallelWaveGAN https://github.com/erogol/WaveRNN

You can see the TB figures below:

m-toman commented 4 years ago

For testing the model, this worked for me:

git clone https://github.com/erogol/WaveRNN.git
git clone https://github.com/mozilla/TTS.git
cd TTS
git checkout dev
mkdir demo_models
cd demo_models
mkdir -p wavernn_models tts_models
wavernn_pretrained_model=wavernn_models/checkpoint_433000.pth.tar
gdown -O ${wavernn_pretrained_model} https://drive.google.com/uc?id=12GRFk5mcTDXqAdO5mR81E-DpTk8v2YS9
wavernn_pretrained_model_config=wavernn_models/config.json
gdown -O ${wavernn_pretrained_model_config} https://drive.google.com/uc?id=1kiAGjq83wM3POG736GoyWOOcqwXhBulv
 tts_pretrained_model=tts_models/checkpoint_670000.pth.tar
gdown -O ${tts_pretrained_model} https://drive.google.com/uc?id=1_mbQDLHekiearftLaraJaPl-FuNgOzKV
tts_pretrained_model_config=tts_models/config.json
gdown -O ${tts_pretrained_model_config} https://drive.google.com/uc?id=19FQscticcxQIFH4MwnxQ950LyxcN8kli
cd ../..
mkdir TTS/demo_output
python -m TTS.synthesize --use_cuda true --vocoder_config_path TTS/demo_models/wavernn_models/config.json --vocoder_path TTS/demo_models/wavernn_models/checkpoint_433000.pth.tar "Evil is Evil. Lesser, greater, middling… Makes no difference. The degree is arbitary. The definition’s blurred. If I’m to choose between one evil and another… I’d rather not choose at all." TTS/demo_models/tts_models/config.json TTS/demo_models/tts_models/checkpoint_670000.pth.tar TTS/demo_output/

but requires #349

Interesting result for that long Witcher quote: evil.zip Seems dot is mapped to the breathing sound ;)

erogol commented 4 years ago

You can now test this model with PWGAN using: https://github.com/mozilla/TTS/blob/dev/notebooks/Benchmark-PWGAN.ipynb

erogol commented 4 years ago

I added a colab example running this model with PWGAN vocoder https://colab.research.google.com/drive/1cpofjnfKSpFhiREgExENIsum4MrqxyPR

reuben commented 4 years ago

I released a new server package with this model embedded in it: https://github.com/mozilla/TTS/wiki/Released-Models#simple-packaging---self-contained-package-that-runs-an-http-api-for-a-pre-trained-tts-model

erogol commented 4 years ago

I also created an example colab using MelGAN as a vocoder. It's been trained by changing the PWGAN generator with MelGAN's architecture. It performs a bit better and slightly faster.

https://colab.research.google.com/drive/1Zg9jR27Pr-ziVa0krjtdoy2dKv6whv7b

nmstoker commented 4 years ago

The quality with this latest colab is amazing and it does well even with longer sentences:slightly_smiling_face:

Btw there was a reference to pwgan model that needed to be switched to Megan, but otherwise this is so straightforward to use.

erogol commented 4 years ago

@nmstoker good to hear that :)

What do you mean by "reference to pwgan"? Do you mean the server release?

nmstoker commented 4 years ago

What do you mean by "reference to pwgan"? Do you mean the server release?

Sorry, I wasn't clear last night. It's a tiny thing, but in the last but one cell of the MelGAN Colab above, there's this line

vocoder_model.load_state_dict(torch.load(PWGAN_MODEL,` map_location="cpu")["model"]["generator"])

And PWGAN_MODEL isn't defined (it's a simple matter of updating it to MELGAN_MODEL)

nmstoker commented 4 years ago

Also, I noticed that the checkout given in the Colab appears not to exist. In the cell with this:

%cd ../ParallelWaveGAN/
! git checkout 22018e6

it didn't seem to cause a problem, it just quietly failed with:

error: pathspec '22018e6' did not match any file(s) known to git.

Presumably, right now, nothing has changed to break it since then.

It looks like it's probably meant to be git checkout a22018e, given there's this hash from a commit around the right time: a22018e6e6be1f9381b003496cc285bdd5a4a284 and it's just offset by one character.

nmstoker commented 4 years ago

I've got what may be a silly question (if so, sorry! :slightly_smiling_face: )

Comparing the training stats charts above with the values set in the config.json for the released model, I see that for the orange line the stats change as if they're undergoing gradual training (ie they move at 50k, 130k, 290k) and then you've switched to BN fine-tuning with the blue line at 400k.

That orange gradual training pattern is consistent the "gradual_training" values in the config file released with the model, but I see the comment mention that gradual training is only for Tacotron, and yet this is Tacotron 2. Perhaps the comment simply hasn't been updated? (it's like that in all the configs I've seen since it was introduced)

Q. Does gradual training work for both models now? Or am I missing something about how you set the config file up for the initial orange training run? (eg that was changed when you switched to BN fine-tuning)

Thanks!

Edit: actually it looks like maybe the comment has been removed from the config.json here: https://github.com/mozilla/TTS/blob/master/config.json#L41 so presumably it does now work for both

erogol commented 4 years ago

@nmstoker yeah it works for the both now :)

george-roussos commented 4 years ago

Model produces great results and shall try to adapt it to a new speaker with an average-sized dataset (6hrs), male voice, no silences, clean audio. Will report results.

I was able to test the model yesterday, however I keep getting an AttributeError: 'AttrDict' object has no attribute 'mulaw' error, even though I have defined mulaw in both config files I use (do I define it as true or what). I might be doing something wrong. Anybody care enough to chime in?

erogol commented 4 years ago

@george-roussos you are training which model exactly.

If something is missing in the config file just add it. In the worse case, you can try what seems logical, but for mu-law thing, it is about WaveRNN vocoder which is not related.

george-roussos commented 4 years ago

@george-roussos you are training which model exactly.

If something is missing in the config file just add it. In the worse case, you can try what seems logical, but for mu-law thing, it is about WaveRNN vocoder which is not related.

I am not training anything right now, I am testing the model. The implementation I have is the TTS model trained on forward attention and batch normalization and the WaveRNN vocoder, which I am guessing is universal. My thought was I could first try and finetune the TTS model and see how it performs when adapted on a new voice when the data is clean and not sparse. Do you think it would be possible and, if so, what would your expectation be with a good quality dataset of 12 hours?

george-roussos commented 4 years ago

Back again. What branch/commit should we use to retrain the TTS model? I am trying to run distribute.py and use the config checkpoint_670000.pth.tar has, but is that the correct way to do it?

erogol commented 4 years ago

@george-roussos it is not the right place to go with this topic. You better post it on discord.

Try the commit version given with the model (model table) and yes it is the right way.

george-roussos commented 4 years ago

Is there any way we can make this model compatible with a universal WaveRNN vocoder after fine-tuning to a new voice? I tried to plug in the universal checkpoint from the git repo, but I get RuntimeError: Error(s) in loading state_dict for Model: size mismatch for upsample.up_layers.5.weight: copying a param with shape torch.Size([1, 1, 1, 17]) from checkpoint, the shape in current model is torch.Size([1, 1, 1, 23]). The model @m-toman links above only works with LJSpeech.

By the way, the results I am getting after fine-tuning on a 5 hour long dataset with transcription errors, is pretty good...

erogol commented 4 years ago

The difference between universal WaveRNN and the TTS model you are using is the sampling rate. WaveRNN model uses 16K and TTS model uses 22050. So maybe you need to finetune WaveRNN too with this rate. Or you can reduce the sampling rate as you finetune TTS with your dataset.

You also need to check out the right version of WaveRNN given with the model checkpoint.

george-roussos commented 4 years ago

Thanks. Do I checkout in the commit given? I imagine fine-tuning to 22050 is not as simple as editing the rate in config.json and restoring the checkpoint?

Jackiexiao commented 4 years ago

@george-roussos it is not the right place to go with this topic. You better post it on discord.

Try the commit version given with the model (model table) and yes it is the right way.

is there any discord server for tts topic?

nmstoker commented 4 years ago

@Jackiexiao yes, please have a look at the main page of this repo https://github.com/mozilla/TTS and you'll see the link to the Discourse forum there

mozilla / TTS

Model Release: Tacotron2 with Forward Attention - LJSpeech #345