keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.95k stars 959 forks source link

model produces faster rate of speech for longer sentences. #186

Closed LearnedVector closed 6 years ago

LearnedVector commented 6 years ago

Hello,

Below are a few examples of my output as well as a few issues I've run into. Hopefully, this post can also help others.

So I've trained tacotron using the LocationSensitiveAttention found in tacotron2 branch, on a custom dataset. The issue is, during evaluation the model seems to decode speech to a max of 50 decoding time steps, even when the utterance requires more timesteps. This causes the speech rate to speed up during evaluation inference.

all hyperparameters are of default value found here except max_iters is set to 300 during training instead of 200, and 320 during eval to evaluate longer utterances (doing 300 during eval makes no difference for this issue).

Below you can see the evaluation alignment of two sentences. One is much longer than the other, but they still decode to about the same decoding timestep.

eval-171000-0 wav file for above alignment. Note that the long horizontal line is just silence, and that got trimmed out.

eval-171000-1 wav file for above alignment. Note that the long horizontal line is just silence, and that got trimmed out.

Below is an example of alignment and voice samples for training data. You can see that the speech rate is normal and it can decode past 60 time steps. step-130000-align wav file for above alignment

The model can decode anything in shorter timesteps fine eval-171000-7 wav file for above alignment. Note that the long horizontal line is just silence, and that got trimmed out.

Also, for reference, here is the distribution of my dataset. The x-axis is the character length per sample and the Y-axis is the frequency of samples with that character length. download

Any ideas on why the model seems to try to fit speech in 50 decoding timesteps?

ryancwalsh commented 6 years ago

I'm super new to Python, TensorFlow, Tacotron, and neural networks in general. I'm finding it REALLY difficult to get started.

I was very excited to follow the instructions at https://github.com/keithito/tacotron and (after hours of debugging) get the browser demo to work.

However, like I think you're saying, it seems to handle only 1 or 2 short sentences at a time.

If I try to synthesize this text, it totally freaks out and sounds weird:

Little Bo-Peep has lost her sheep, And can't tell where to find them; Leave them alone, and they'll come home, And bring their tails behind them.

Little Bo-Peep fell fast asleep, And dreamt she heard them bleating; But when she awoke, she found it a joke, For still they all were fleeting.

Then up she took her little crook, Determined for to find them; She found them indeed, but it made her heart bleed, For they'd left all their tails behind 'em!

It happened one day, as Bo-peep did stray Unto a meadow hard by-- There she espied their tails, side by side, All hung on a tree to dry.

She heaved a sigh and wiped her eye, And over the hillocks she raced; And tried what she could, as a shepherdess should, That each tail should be properly placed.

I wonder how we can fix this (and if someone already has)? I will definitely try to find a workaround because I want to synthesize ~1 page of text at a time.

@LearnedVector I would LOVE any advice you have, also, on how I can use my own audio recordings to retrain this system to speak in my own voice. If I have a long YouTube clip of me speaking, and the clip has SRT or VTT subtitles / captions, and I export the video and the captions, can I use that? If not, what is the process for providing audio and transcriptions to feed into the system somehow? Thanks so much.

LearnedVector commented 6 years ago

@ryancwalsh sorry for late reply. I haven't tried it but If you want to generate longer sentences you need to increase the max_iters hyperparameter in your demo server. It's currently set at a max of 12.5 seconds I believe and will choke up on any sentence that is longer than that. Also, you can try the youtube video's, not sure how well it'll turn out. For me, I had a colleague record 16 hours of clear and quality audio samples from a corpus we came up with.

ryancwalsh commented 6 years ago

@LearnedVector Thanks for your response. I've gotten around the challenge of the short clips by just splitting my page of text into sentences and processing them each separately and then concatenating the wav files later.

However, I've been really struggling otherwise. I have 99 minutes of audio clips (max 10 seconds each) with transcriptions. I'm trying to start with Linda Johnson as a basis and then get it to ultimately sound like the voice of my recordings.

I've tried all of these:

But none of my results are intelligible enough.

And https://github.com/Kyubyong/speaker_adapted_tts claims that only 1 minute of sample recordings are required.

Baidu claims that only 3 seconds of samples are required!

My main question is: what is the best documented approach to creating a TTS in a certain voice?

And also, as I'm completely new to machine learning, what do your graphs above mean?

And when I read my graphs, how am I supposed to adjust something based on what I see?

Where can I learn more about how to train a system to adapt a voice from Linda Johnson to another voice?

Thank you so much :-)

LearnedVector commented 6 years ago

@ryancwalsh those are different techniques and does not apply to this tacotron repo specifically.

LearnedVector commented 6 years ago

Found the source of the issue. this was due to my dataset being heavily unbalanced in character lengths. You can see more details here. closing