Difference in Synthesized audio between training checkpoints and demo server

keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)

MIT License

2.95k stars 959 forks source link

Difference in Synthesized audio between training checkpoints and demo server #232

Open vaibhavthapliyal opened 5 years ago

vaibhavthapliyal commented 5 years ago

Hey everyone,

I am working with Hindi Language. My dataset is pretty small right now(about 700 clips) and the speaker is a male. I wanted to check if a reasonable output can be produced using this so that I can increase the dataset to about 6000 to 7000 clips.

However I am facing an issue: The audio generated at 65000 steps checkpoint is pretty good but when I use the same text through the demo server UI I do not get a similar output.

Here's are the audio files generated for the input: ki bhaart kii senaa men sshstr blon men puruss shkti hii nhiin shrii shkti kaa bhii

Output from the checkpoint: step-66000-audio.zip

Output from demo server: 77e719af-3b21-4063-b3f4-4401993c6fc9.zip

Here's my alignment graph: step-66000-align

Some More Info:

I am using transliteration_cleaner in the hparams.py for hindi language.
The batch size is 8.
And max_iters is set to 1000 since I have some long audios as well.

Any help in this would be appreciated.

Thanks

keithito commented 5 years ago

The audio saved during training sounds okay because it's using teacher forcing -- sending the previous ground truth output as input to the next step in the decoder. From your attention plot, it looks like the model has not learned a good alignment, so you shouldn't expect to get good output. You most likely need more training data than you're using. 700 clips is probably not enough.

vaibhavthapliyal commented 5 years ago

@keithito Thanks for replying. I am in process of increasing a dataset to about 5000 clips. Should that be enough. I am currently working with the fork that requires less data for alignment. I'll post my progress here only.

Thanks for your awesome work.

begeekmyfriend commented 5 years ago

@vaibhavthapliyal You may try my fork if the quantity of your dataset is not enough https://github.com/keithito/tacotron/issues/198

vaibhavthapliyal commented 5 years ago

@begeekmyfriend I am trying yours currently with the LJ speech dataset and have successfully got the alignment for that. Next I am planning to have a POC for the same speaker in English language by using transfer learning as indicated in some of issues. Then I will use your fork to train for hindi language. Will keep you posted as well. Thanks!

vaibhavthapliyal commented 5 years ago

Hi everyone, I increased my dataset and was able to produce a reasonable output in Hindi.

I increased the dataset to 3600 clips and ran one round of training and then again increased the dataset to 6400 clips and ran another round of training. However I have not observed much difference in the quality of output between the two trained models although they produce a good alignment graph.

I am using the fork by @begeekmyfriend in the training process.

Here is an alignment curve from the training process where clips were 3600: step-122000-align

Here is an alignment curve from training process when I increased the dataset to 6400 clips: step-205000-align

However when synthesizing sentences from the checkpoints obtained from both models there is not much difference in terms of speech construction. The hparams were same in both the runs.

Is this behaviour normal? I was expecting a better output by increasing the dataset.

Thanks Vaibhav

begeekmyfriend commented 5 years ago

@vaibhavthapliyal Did you use the master branch or the mandarin branch?

vaibhavthapliyal commented 5 years ago

Hi,

I used the mandarin branch and head is at commit b0461d8b23cecf3ac19c83563ee39614cfe09f74.

begeekmyfriend commented 5 years ago

@vaibhavthapliyal the mandarin branch is only for Chinese mandarin. Now I switch to the master branch as the default branch and merge all the updates into it. The master branch supports English language. Please reclone the repo and try again.

vaibhavthapliyal commented 5 years ago

@begeekmyfriend I see in your latest commit that you have merged the mandarin branch in the master. Should I remove that commit from the master branch when using your code for my use case as I am not using Chinese language?

begeekmyfriend commented 5 years ago

The master branch is for English language. And the latest commit is for improvement. You may use it directly for your English corpus.

vaibhavthapliyal commented 5 years ago

@begeekmyfriend I got this error while running the training on the master branch:

[2018-12-12 14:29:46.640] Saving checkpoint to: ./logs-tacotron/model.ckpt-2000 [2018-12-12 14:29:48.263] Saving audio and alignment... [2018-12-12 14:29:57.653] Exiting due to exception: firwin() got an unexpected keyword argument 'fs'

The training stops after this error. Any pointers as to why this is happening?

begeekmyfriend commented 5 years ago

pip install scipy -U