syang1993 / gst-tacotron

A tensorflow implementation of the "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"
368 stars 110 forks source link

Mumbling in synthesis #45

Open a-froghyar opened 3 years ago

a-froghyar commented 3 years ago

Hey, thanks for the implementation @syang1993!

I'm using this code to implement another paper and I've bumped into some issues during synthesis. I'm getting good alignment on training and the interim synthesised results sound good, however during evaluation, the synthesis is very unpredictable and sometimes fails to synthesise understandable speech. It rather sounds like mumbling. It's not only on long utterances, but sometimes on short and mid-length texts too. I'm attaching a few alignment plots and audio examples.

I was wondering if you've come across this before and if you have any tips where I should look to fix this issue? I've trained the model using the multihead attention, do you reckon the GMM attention will improve a lot? eval-320000_ref-frankenstein_chp_13-4-align eval-320000_ref-frankenstein_chp_13-3-2-align mumbling_samples.zip

EFHIII commented 3 years ago

This paper attempts to address this https://arxiv.org/abs/1910.10288 https://google.github.io/tacotron/publications/location_relative_attention/index.html

There appears to be a PyTorch implementation https://github.com/bshall/Tacotron