keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.96k stars 956 forks source link

Start / stop symbols #161

Open jlin816 opened 6 years ago

jlin816 commented 6 years ago

I'm trying to debug echoing and overly long output audio, and in the spirit of #31 and #133, seeing if I can fix things by playing around with the start and stop symbols. The Tensorflow seq2seq documentation is sparse, and I can't tell from https://github.com/keithito/tacotron/blob/master/models/helpers.py what exactly is going on with start and stop symbols in testing and training. If I'm using my own dataset separate from blizzard/ljspeech, does the model expect:

Thanks in advance!

keithito commented 6 years ago

Hi! Yes, you shouldn't have to worry about adding any of the special symbols -- datafeeder.py will take care of padding the inputs and adding the end token, and the helper takes care of generating the start frame. Your input can just be the spectrograms and text.

begeekmyfriend commented 6 years ago

Don't you think about a minimum modification to train the decoder to stop at the end of padding in the targets in helper.py?

finished = tf.reduce_all(tf.equal(self._targets[:, time], [_pad]), axis=1)
jlin816 commented 6 years ago

@begeekmyfriend Can you elaborate?

begeekmyfriend commented 6 years ago

Well I have referred Rayhane's version https://github.com/Rayhane-mamah/Tacotron-2/issues/46 in which it will not stop decoding until the stop token instead of the end of full length is touched. It trains stop token targets to learn when to stop decoding and it works for synthesized audio. And I want find out whether we need to train the extra stop token targets or not.

PengjuYan commented 6 years ago

Hi Keith, is the determination of "finish" too rigid?

finished = tf.reduce_all(tf.equal(outputs, self._end_token), axis=1)

I'm thinking about whether it is enough to make it an end when one of the outputs is close enough to self._end_token rather than it equals exactly all 0s.

begeekmyfriend commented 6 years ago

@PengjuYan I have shot a PR including this idea https://github.com/keithito/tacotron/pull/204/commits/3ca25ef1b2cbf8f2470c55c923d093dcde19434f

PengjuYan commented 6 years ago

@begeekmyfriend Can you share with us whether it produces audio with higher quality?

begeekmyfriend commented 6 years ago

5 hour only and 66K steps. eval-66000-linear.zip