Open jlin816 opened 6 years ago
Hi! Yes, you shouldn't have to worry about adding any of the special symbols -- datafeeder.py will take care of padding the inputs and adding the end token, and the helper takes care of generating the start frame. Your input can just be the spectrograms and text.
Don't you think about a minimum modification to train the decoder to stop at the end of padding in the targets in helper.py
?
finished = tf.reduce_all(tf.equal(self._targets[:, time], [_pad]), axis=1)
@begeekmyfriend Can you elaborate?
Well I have referred Rayhane's version https://github.com/Rayhane-mamah/Tacotron-2/issues/46 in which it will not stop decoding until the stop token instead of the end of full length is touched. It trains stop token targets to learn when to stop decoding and it works for synthesized audio. And I want find out whether we need to train the extra stop token targets or not.
Hi Keith, is the determination of "finish" too rigid?
finished = tf.reduce_all(tf.equal(outputs, self._end_token), axis=1)
I'm thinking about whether it is enough to make it an end when one of the outputs is close enough to self._end_token
rather than it equals exactly all 0
s.
@PengjuYan I have shot a PR including this idea https://github.com/keithito/tacotron/pull/204/commits/3ca25ef1b2cbf8f2470c55c923d093dcde19434f
@begeekmyfriend Can you share with us whether it produces audio with higher quality?
5 hour only and 66K steps. eval-66000-linear.zip
I'm trying to debug echoing and overly long output audio, and in the spirit of #31 and #133, seeing if I can fix things by playing around with the start and stop symbols. The Tensorflow seq2seq documentation is sparse, and I can't tell from https://github.com/keithito/tacotron/blob/master/models/helpers.py what exactly is going on with start and stop symbols in testing and training. If I'm using my own dataset separate from blizzard/ljspeech, does the model expect:
initialize
function seems like it might be doing something like this?), or am I supposed to be providing certain things in the data? Is this different for training and testing?Thanks in advance!