Closed erogol closed 5 years ago
@erogol - tried to load the checkpoint with the latest code on the dev-tacotron2
branch. I get the following error:
RuntimeError: Error(s) in loading state_dict for Tacotron2:
Missing key(s) in state_dict: "decoder.attention_layer.ta.weight", "decoder.attention_layer.ta.bias".
Solved - just make sure you use the right config.json files :)
@erogol - I tried to train a new WaveRNN model (from scratch and finetune on top of yours) as well as use my previous implementation of WaveRNN. For each one the output is very scrambled:
https://drive.google.com/open?id=1iHo-b3WwGrvRUc-RjhpQA_G0GgycsENW
When I use point the vocoder to the MOLD model that you published, I get clearer speech (I can make out all of the words) but with noise. Any ideas?
You need to train more to get cleaner output, but LJSpeech is also noisy. So to a level, it is acceptable.
@erogol - thanks. Is this the case even when I'm fine-tuning? By training more, do you mean training tacotron more or WaveRNN? How many steps should it generally start to get better?
I checked the alignment of what tacotron produces and it seems like the alignment is there.
I meant to train WaveRNN. If you train from the start, it sounds good after 300K iters but depends on the dataset.
@erogol Thanks. From your experience, do you think it's possible to fine-tune WaveRNN like we can fine-tune tacotron? My dataset is just a couple of hours so it might not be enough to train from scratch.
I've also tried to use my own implementation of WaveRNN (very similar to yours) and after 900k steps, it works well with Rayhane's tacotron implementation but not yours.
finetuning WaveRNN works but I've not tried a small dataset to finetune.
@erogol - I tried to finetune to 731k steps, the output still sounds scrambled: https://drive.google.com/file/d/1niGB9-IvkjW-Q7MTrgTtwa96Sp8Bu6Ub/view?usp=sharing
Any tips on what I can do to debug or see what might be wrong?
How to use WaveRnn model?
I have downloaded mold_ljspeech_best_model
from here https://github.com/erogol/WaveRNN#released-models (https://drive.google.com/drive/folders/1wpPn3a0KQc6EYtKL0qOi4NqEmhML71Ve)
And use suggested notebook from db7f3d3 https://github.com/mozilla/TTS/blob/db7f3d36e7768f9179d42a8f19b88c2c736d87eb/notebooks/Benchmark.ipynb
But in config I can't see CONFIG.use_phonemes
and CONFIG.embedding_size
Update: I fix it, tacotron2 and wavernn is separate models and should be used from specific commits.
I have tried tacotron2 + wavernn and found that quality is good but wavernn is too slow on CPU about 3 sec for tacotron 2 and about 30 sec for wavernn, so it's comparable with waveglow model in terms of speed, but wavernn model size is smaller. Also tacotron 2 processing speed depends on sentence length(i.e. shorter sentences processed faster ~1 sec), but for wavernn it's also high for short sentences ~25 sec, why?
Model size:
Tacotron2:
336Mb ljspeech-260k/checkpoint_260000.pth.tar
WaveRnn:
49Mb mold_ljspeech_best_model/checkpoint_393000.pth.tar
@erogol Do you have a model trained on the latest commit?
@erogol - tried to load the checkpoint with the latest code on the
dev-tacotron2
branch. I get the following error:RuntimeError: Error(s) in loading state_dict for Tacotron2: Missing key(s) in state_dict: "decoder.attention_layer.ta.weight", "decoder.attention_layer.ta.bias".
Solved - just make sure you use the right config.json files :)
may i know which config.json file solve your issue? @ZohaibAhmed
@CorentinJ not yet but I'll be releasing new models soon.
@erogol is there any way to just run the pre-trained model with custom given inputs in an "easy" way (I don't really understand most of the code just yet - as I'm still learning about ML).
@RaulButuc check out the instructions here: https://github.com/mozilla/TTS/wiki/Released-Models#simple-packaging---self-contained-package-that-runs-an-http-api-for-a-pre-trained-tts-model
@reuben i actually tried that yesterday but unfortunately there is a conflict of pytorch versions in the requirements (had to manually download an older pytorch .whl to be able to install the TTS-0.0.1 package, which then throws a dependency requirements error when I try to run it)
EDIT:
You should be able to create and use a fresh virtualenv to avoid any conflicts.
On 30 Dec 2019, at 02:51, Raul Butuc notifications@github.com wrote:
@reuben i actually tried that yesterday but unfortunately there is a conflict of pytorch versions in the requirements (had to manually download an older pytorch .whl to be able to install the TTS-0.0.1 package, which then throws a dependency requirements error)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
A new TTS Tacotron2 model trained on LJSpeech is released. It should work well with the MOLD WaveRNN model.
Model has been trained for 260K iterations. It has the best validation loss so far on LJSpeech.
Model has been trained first with dropout Prenet as in the original paper and then switched to BN prenet described above. And finally, it's been trained with "forward attention." for just experimental reasons.
In inference time you can try different attention related parameters and pick the one best fits you. So you can switch on/off forward attention, use "sigmoid" or "softmax" norm or consider to use attention windowing. The default settings are given by the model's config.json.
I think both WaveRNN and TTS models have more space for finetuning (especially WaveRNN) for better results.
You can also read more here #26