New Tacotron2 model release with WaveRNN vocoder.

erogol commented 5 years ago

A new TTS Tacotron2 model trained on LJSpeech is released. It should work well with the MOLD WaveRNN model.

Model has been trained for 260K iterations. It has the best validation loss so far on LJSpeech.
Model has been trained first with dropout Prenet as in the original paper and then switched to BN prenet described above. And finally, it's been trained with "forward attention." for just experimental reasons.
In inference time you can try different attention related parameters and pick the one best fits you. So you can switch on/off forward attention, use "sigmoid" or "softmax" norm or consider to use attention windowing. The default settings are given by the model's config.json.
I think both WaveRNN and TTS models have more space for finetuning (especially WaveRNN) for better results.

You can also read more here #26

ZohaibAhmed commented 5 years ago

@erogol - tried to load the checkpoint with the latest code on the dev-tacotron2 branch. I get the following error:

RuntimeError: Error(s) in loading state_dict for Tacotron2:
    Missing key(s) in state_dict: "decoder.attention_layer.ta.weight", "decoder.attention_layer.ta.bias".

Solved - just make sure you use the right config.json files :)

ZohaibAhmed commented 5 years ago

@erogol - I tried to train a new WaveRNN model (from scratch and finetune on top of yours) as well as use my previous implementation of WaveRNN. For each one the output is very scrambled:

https://drive.google.com/open?id=1iHo-b3WwGrvRUc-RjhpQA_G0GgycsENW

When I use point the vocoder to the MOLD model that you published, I get clearer speech (I can make out all of the words) but with noise. Any ideas?

erogol commented 5 years ago

You need to train more to get cleaner output, but LJSpeech is also noisy. So to a level, it is acceptable.

ZohaibAhmed commented 5 years ago

@erogol - thanks. Is this the case even when I'm fine-tuning? By training more, do you mean training tacotron more or WaveRNN? How many steps should it generally start to get better?

I checked the alignment of what tacotron produces and it seems like the alignment is there.

erogol commented 5 years ago

I meant to train WaveRNN. If you train from the start, it sounds good after 300K iters but depends on the dataset.

ZohaibAhmed commented 5 years ago

@erogol Thanks. From your experience, do you think it's possible to fine-tune WaveRNN like we can fine-tune tacotron? My dataset is just a couple of hours so it might not be enough to train from scratch.

I've also tried to use my own implementation of WaveRNN (very similar to yours) and after 900k steps, it works well with Rayhane's tacotron implementation but not yours.

erogol commented 5 years ago

finetuning WaveRNN works but I've not tried a small dataset to finetune.

ZohaibAhmed commented 5 years ago

@erogol - I tried to finetune to 731k steps, the output still sounds scrambled: https://drive.google.com/file/d/1niGB9-IvkjW-Q7MTrgTtwa96Sp8Bu6Ub/view?usp=sharing

Any tips on what I can do to debug or see what might be wrong?

mrgloom commented 5 years ago

How to use WaveRnn model? I have downloaded mold_ljspeech_best_model from here https://github.com/erogol/WaveRNN#released-models (https://drive.google.com/drive/folders/1wpPn3a0KQc6EYtKL0qOi4NqEmhML71Ve) And use suggested notebook from db7f3d3 https://github.com/mozilla/TTS/blob/db7f3d36e7768f9179d42a8f19b88c2c736d87eb/notebooks/Benchmark.ipynb But in config I can't see CONFIG.use_phonemes and CONFIG.embedding_size

Update: I fix it, tacotron2 and wavernn is separate models and should be used from specific commits.

mrgloom commented 5 years ago

I have tried tacotron2 + wavernn and found that quality is good but wavernn is too slow on CPU about 3 sec for tacotron 2 and about 30 sec for wavernn, so it's comparable with waveglow model in terms of speed, but wavernn model size is smaller. Also tacotron 2 processing speed depends on sentence length(i.e. shorter sentences processed faster ~1 sec), but for wavernn it's also high for short sentences ~25 sec, why?

Model size:

Tacotron2:
    336Mb ljspeech-260k/checkpoint_260000.pth.tar
WaveRnn:
    49Mb mold_ljspeech_best_model/checkpoint_393000.pth.tar

CorentinJ commented 5 years ago

@erogol Do you have a model trained on the latest commit?

haqkiemdaim commented 5 years ago

@erogol - tried to load the checkpoint with the latest code on the dev-tacotron2 branch. I get the following error:
RuntimeError: Error(s) in loading state_dict for Tacotron2:
  Missing key(s) in state_dict: "decoder.attention_layer.ta.weight", "decoder.attention_layer.ta.bias".
Solved - just make sure you use the right config.json files :)

may i know which config.json file solve your issue? @ZohaibAhmed

erogol commented 5 years ago

@CorentinJ not yet but I'll be releasing new models soon.

RaulButuc commented 4 years ago

@erogol is there any way to just run the pre-trained model with custom given inputs in an "easy" way (I don't really understand most of the code just yet - as I'm still learning about ML).

reuben commented 4 years ago

@RaulButuc check out the instructions here: https://github.com/mozilla/TTS/wiki/Released-Models#simple-packaging---self-contained-package-that-runs-an-http-api-for-a-pre-trained-tts-model

RaulButuc commented 4 years ago

@reuben i actually tried that yesterday but unfortunately there is a conflict of pytorch versions in the requirements (had to manually download an older pytorch .whl to be able to install the TTS-0.0.1 package, which then throws a dependency requirements error when I try to run it)

EDIT:

I will try to do a clean install since maybe something got messed up yesterday with all the trial-and-error.
Also forgot to mention, I was actually rather interested in something similar to https://github.com/fatchord/WaveRNN (where you can just run a quick_start.py script with custom sentences) but for the 10bit version of the model given by @erogol. I tried writing one myself based on all the samples I could find here on GH, but not sure I fully understand how to correctly load the models

reuben commented 4 years ago

You should be able to create and use a fresh virtualenv to avoid any conflicts.

On 30 Dec 2019, at 02:51, Raul Butuc notifications@github.com wrote:

@reuben i actually tried that yesterday but unfortunately there is a conflict of pytorch versions in the requirements (had to manually download an older pytorch .whl to be able to install the TTS-0.0.1 package, which then throws a dependency requirements error)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

mozilla / TTS

New Tacotron2 model release with WaveRNN vocoder. #153