keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.95k stars 960 forks source link

reconstruct the zero-padded frames #173

Open yyt233 opened 6 years ago

yyt233 commented 6 years ago

As the paper said, We train using a batch size of 32, where all sequences are padded to a max length. It’s a common practice to train sequence models with a loss mask, which masks loss on zero-padded frames.However, we found that models trained this way don’t know when to stop emitting outputs, causing repeated sounds towards the end. One simple trick to get around this problem is to also reconstruct the zero-padded frames.

This seems to be the author's method of eliminating echo. So,do you have any ideas for reconstructing the zero-padded frames?@keithito Thank you!

begeekmyfriend commented 6 years ago

One of the approaches is to train some extra stop token targets to decide when to stop decoding rather than see whether the time step reaches the token targets length.

yyt233 commented 6 years ago

@begeekmyfriend Did you successfully solve the echo problem with this method? What are the detailed steps?Could you please explain it in detail? I tried to trim the silence of the begin and end,and add 1s silence at the end of the audio,but these do not work.

begeekmyfriend commented 6 years ago

That is quite simple. We can synchronize stop token targets with the length of mel targets and pad the with _token_pad indicating the stop time step.

yyt233 commented 6 years ago

@begeekmyfriend Thank you very much! I'll have a try.