keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.94k stars 965 forks source link

The dim problem in stop token prediction in @begeekmyfriend's fork #285

Closed xus-stack closed 5 years ago

xus-stack commented 5 years ago

In datafeeder.py : line 117: stop_token_target = np.asarray([0.] * len(mel_target)) Apparently the shape of stop_token_target here is [M, ]. It is [N, M] in the batch, maybe?

But in tacotron.py : line 82: stop_token_outputs = tf.reshape(stop_token_outputs, [batch_size, -1]) # **[N, T_out, M]** line 116~118: self.stop_token_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits( labels=self.stop_token_targets, logits=self.stop_token_outputs)) In rnn_wrappers.py : Apparently, the stop_tokenoutput is a scalar for each decoder step. Isn't the output shaped [N, T_out, 1]?

How do the dimension of target and output of stop token match in the CODE. Can somebody explain this? THX! @begeekmyfriend

xus-stack commented 5 years ago

sorry, I got it. the size of stop_token_target is [T_out,] apparently, I was misguided by the notation in line 82. the size of stop_token_outputs should be [N, T_out, 1] this issue is closed.

xus-stack commented 5 years ago

But another difference concerns me: In @begeekmyfriend'fork datafeeder.py : line 117: stop_token_target = np.asarray([0.] * len(mel_target)) In tacotron 2 Tacotron-2/tacotron/feeder.py: line 194: token_target = np.asarray([0.] * (len(mel_target) - 1)) This difference is notable. might be a problem here?

I believe 1 should be added to the last frame of the target

xus-stack commented 5 years ago

I see, that's a little different from Taco 2, but taken care in feeder.py