Closed r9y9 closed 5 years ago
Updated the notebook. Sadly, the results are not as much good as I expected. I will need to debug what's wrong.
@r9y9 Thanks for putting this together. Do you know why you can't reproduce the results and how different is your implementation from the tensorflow implementation you mentioned on the README?
@rafaelvalle No, I'm still not sure why I cannot reproduce. I tried to implement the same architecture and same optimization/decoding algorithms with tensorflow implementation, but I might be doing wrong in some place...
@r9y9 I'll let you know if I find anything!
I'll post questions here. If you want I can open one issue per question. You have a pre-highway dense layer that I don't remember seeing on the paper or on a tf implementation. Why is that?
Looking at your CBHG, it looks like you're not concatenating the output of the convolution banks but feeding the output of each bank onto the next bank, please compare your code and the tf code..
Your CBHG forward pass has:
for conv1d in self.conv1d_banks:
x = conv1d(x)[:, :, :T]
The tf code has:
for k in range(2, K+1): # k = 2...K
with tf.variable_scope("num_{}".format(k)):
output = conv1d(inputs, hp.embed_size // 2, k)
outputs = tf.concat((outputs, output), -1)
Opps, my bad! Thank you very much for pointing this out. I will fix this and rerun experiments! I will let you know if I have a fix.
Okay, I think I have a fix.
Looks right to me! Will you post something here after you rerun the experiments?
Sure, but my GPU is busy right now so it takes time to rerun the experiments.
ah, I didn't mean to close this.
Your attention wrapper seems to be different from the tf implementation.
You concatenate the query and the attention before feeding it into the AttentionRNN whereas in the tf implementation the AttentionRNN only uses the output of the prenet, i.e. the query. Unless tf's attention wrapper does it internally, which I think is the case!
You concatenate the query and the attention before feeding it into the AttentionRNN whereas in the tf implementation the AttentionRNN only uses the output of the prenet, i.e. the query. Unless tf's attention wrapper does it internally, which I think is the case!
TensorFlow's AttentionWrapper actually does this internally. https://github.com/tensorflow/tensorflow/blob/624bcfe409601910951789325f0b97f520c0b1ee/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py#L1236-L1238
TF's seq2seq APIs are very high-level and often hard to imagine what it really does..
What data did you use and where did you download it from?
Single speaker's 24 hours audio data from https://keithito.com/LJ-Speech-Dataset/.
I use https://github.com/keithito/tacotron#training for data preprocessing.
During your experiments, was it necessary to clip the norm of the gradients?
Well, I didn't try turning off gradient clipping, so I'm not sure if it's actually necessary.
Progress: I'm training a new model with the latest code, now at step 600000. Will update the notebook in a few days.
How do the results look like?
Updated: http://nbviewer.jupyter.org/github/r9y9/tacotron_pytorch/blob/f98eda7336726cdfe4ab97ae867cc7f71353de50/notebooks/Test%20Tacotron.ipynb (the link includes git commit hash)
I felt it became a little bit better than before. One interesting thing I noticed is that my implementation can reasonably synthesize long inputs, while keithito/tacotron cannot. Some words will offten be skippped, though. Ref: https://github.com/keithito/tacotron/pull/43#issuecomment-332068107
Thanks for sharing this. Did you see any change in the attention mechanism? And do you think that the difference from keithito's implementation comes from weight initialization and other parameter defaults?
No significant differences from my previous experiment. However, it seems somewhat different from keithito's. I attached learned alignments for
And do you think that the difference from keithito's implementation comes from weight initialization and other parameter defaults?
Sorry, not sure. I guess there's something wrong in model architecture. As for the weight initialization, I tried a few for embedding but got no significant differences. See https://github.com/r9y9/tacotron_pytorch/blob/f98eda7336726cdfe4ab97ae867cc7f71353de50/tacotron_pytorch/tacotron.py#L288-L289.
That is very interesting. You implementation learns the attention for that sentence much faster and in a cleaner manner than keithito's implementation.
Why don't you try initialization on modules' weights & biases? TF's default initializer is Xavier initializer (https://stackoverflow.com/questions/37350131/what-is-the-default-variable-initializer-in-tensorflow) which is different from pytorch's
https://github.com/keithito/tacotron/blob/6612fd3ef6760ffe6f7cce93e00fa8421b98e9b3/models/helpers.py#L35 The difference in EOS decision code may be significant, too. (Maybe eps=0.2 is too large in is_end_of_frames?)
Edit: Well, this has no concern in training phase.
I know there are some differences in weight initialization between pytorch and tesorflow, but just I haven't try Xavier/truncated normal yet.
eps=0.2 for EOS decision is working practically, though there should be more robust and clever approach. As far as I know, Deep voice 3 predicts EOS binary label during inference. https://arxiv.org/abs/1710.07654
https://github.com/keithito/tacotron/blob/6612fd3ef6760ffe6f7cce93e00fa8421b98e9b3/models/tacotron.py#L60 https://github.com/Kyubyong/tacotron_asr/blob/master/networks.py#L81
I think that these tensorflow implementations use (proj + h_1 + h_2) as input of mel_proj.
# Concat RNN output and attention context vector
decoder_input = self.project_to_decoder_in(
torch.cat((attention_rnn_hidden, current_attention), -1))
# Pass through the decoder RNNs
for idx in range(len(self.decoder_rnns)):
decoder_rnn_hiddens[idx] = self.decoder_rnns[idx](
decoder_input, decoder_rnn_hiddens[idx])
# Residual connectinon
decoder_input = decoder_rnn_hiddens[idx] + decoder_input
# Last decoder hidden state is the output vector
output = decoder_rnn_hiddens[-1]
output = self.proj_to_mel(output)
However it seems that yours use h_2 from rnn_2(proj + h1) as above. So I think the fixed code would be like this (output = decoder_input):
# Concat RNN output and attention context vector
decoder_input = self.project_to_decoder_in(
torch.cat((attention_rnn_hidden, current_attention), -1))
# Pass through the decoder RNNs
for idx in range(len(self.decoder_rnns)):
decoder_rnn_hiddens[idx] = self.decoder_rnns[idx](
decoder_input, decoder_rnn_hiddens[idx])
# Residual connectinon
decoder_input = decoder_input + decoder_rnn_hiddens[idx]
# Last decoder hidden state is the output vector
output = decoder_input
output = self.proj_to_mel(output)
Thank you @qbx2, you are right. FIxed by https://github.com/r9y9/tacotron_pytorch/commit/059272508e4bd24e8ba6028e02a247c6cbc4e1b3.
Nice. I'd love to see improved results after the commit.
Oh, and I'd like to suppose to give bias=False in BatchNormConv1d here for better calculation speed. The bias is not needed because BatchNorm1d layer would add it.
self.conv1d = nn.Conv1d(in_dim, out_dim,
kernel_size=kernel_size,
stride=stride, padding=padding)
https://github.com/r9y9/tacotron_pytorch/blob/master/tacotron_pytorch/tacotron.py#L32
FYI: pytorch does not have truncated weight initialization
I knew it, but haven't tried it because
1 .it doesn't exist in PyTorch.
However, I had a experience just now initialization was quite important to learn deep models. So now I think it's definitely worth trying.
I'm trying to use BahdanauMonotonicAttention. But I cann't get the same result as keithito's, (Bahdanau-style) Monotonic Attention. could anybody take a look? https://github.com/r9y9/tacotron_pytorch/issues/8 @r9y9 @rafaelvalle @qbx2
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Note to self: When I finish current experiment, I will update http://nbviewer.jupyter.org/github/r9y9/tacotron_pytorch/blob/master/notebooks/Test%20Tacotron.ipynb.