Open vaibhavthapliyal opened 5 years ago
The audio saved during training sounds okay because it's using teacher forcing -- sending the previous ground truth output as input to the next step in the decoder. From your attention plot, it looks like the model has not learned a good alignment, so you shouldn't expect to get good output. You most likely need more training data than you're using. 700 clips is probably not enough.
@keithito Thanks for replying. I am in process of increasing a dataset to about 5000 clips. Should that be enough. I am currently working with the fork that requires less data for alignment. I'll post my progress here only.
Thanks for your awesome work.
@vaibhavthapliyal You may try my fork if the quantity of your dataset is not enough https://github.com/keithito/tacotron/issues/198
@begeekmyfriend I am trying yours currently with the LJ speech dataset and have successfully got the alignment for that. Next I am planning to have a POC for the same speaker in English language by using transfer learning as indicated in some of issues. Then I will use your fork to train for hindi language. Will keep you posted as well. Thanks!
Hi everyone, I increased my dataset and was able to produce a reasonable output in Hindi.
I increased the dataset to 3600 clips and ran one round of training and then again increased the dataset to 6400 clips and ran another round of training. However I have not observed much difference in the quality of output between the two trained models although they produce a good alignment graph.
I am using the fork by @begeekmyfriend in the training process.
Here is an alignment curve from the training process where clips were 3600:
Here is an alignment curve from training process when I increased the dataset to 6400 clips:
However when synthesizing sentences from the checkpoints obtained from both models there is not much difference in terms of speech construction. The hparams were same in both the runs.
Is this behaviour normal? I was expecting a better output by increasing the dataset.
Thanks Vaibhav
@vaibhavthapliyal Did you use the master
branch or the mandarin
branch?
Hi,
I used the mandarin branch and head is at commit b0461d8b23cecf3ac19c83563ee39614cfe09f74.
@vaibhavthapliyal the mandarin branch is only for Chinese mandarin. Now I switch to the master branch as the default branch and merge all the updates into it. The master branch supports English language. Please reclone the repo and try again.
@begeekmyfriend I see in your latest commit that you have merged the mandarin branch in the master. Should I remove that commit from the master branch when using your code for my use case as I am not using Chinese language?
The master branch is for English language. And the latest commit is for improvement. You may use it directly for your English corpus.
@begeekmyfriend I got this error while running the training on the master branch:
[2018-12-12 14:29:46.640] Saving checkpoint to: ./logs-tacotron/model.ckpt-2000 [2018-12-12 14:29:48.263] Saving audio and alignment... [2018-12-12 14:29:57.653] Exiting due to exception: firwin() got an unexpected keyword argument 'fs'
The training stops after this error. Any pointers as to why this is happening?
pip install scipy -U
Hey everyone,
I am working with Hindi Language. My dataset is pretty small right now(about 700 clips) and the speaker is a male. I wanted to check if a reasonable output can be produced using this so that I can increase the dataset to about 6000 to 7000 clips.
However I am facing an issue: The audio generated at 65000 steps checkpoint is pretty good but when I use the same text through the demo server UI I do not get a similar output.
Here's are the audio files generated for the input: ki bhaart kii senaa men sshstr blon men puruss shkti hii nhiin shrii shkti kaa bhii
Output from the checkpoint: step-66000-audio.zip
Output from demo server: 77e719af-3b21-4063-b3f4-4401993c6fc9.zip
Here's my alignment graph:
Some More Info:
Any help in this would be appreciated.
Thanks