r9y9 / tacotron_pytorch

PyTorch implementation of Tacotron speech synthesis model.
http://nbviewer.jupyter.org/github/r9y9/tacotron_pytorch/blob/master/notebooks/Test%20Tacotron.ipynb
Other
306 stars 79 forks source link

Update model #2

Closed r9y9 closed 5 years ago

r9y9 commented 7 years ago

Note to self: When I finish current experiment, I will update http://nbviewer.jupyter.org/github/r9y9/tacotron_pytorch/blob/master/notebooks/Test%20Tacotron.ipynb.

r9y9 commented 7 years ago

Updated the notebook. Sadly, the results are not as much good as I expected. I will need to debug what's wrong.

rafaelvalle commented 7 years ago

@r9y9 Thanks for putting this together. Do you know why you can't reproduce the results and how different is your implementation from the tensorflow implementation you mentioned on the README?

r9y9 commented 7 years ago

@rafaelvalle No, I'm still not sure why I cannot reproduce. I tried to implement the same architecture and same optimization/decoding algorithms with tensorflow implementation, but I might be doing wrong in some place...

rafaelvalle commented 7 years ago

@r9y9 I'll let you know if I find anything!

rafaelvalle commented 7 years ago

I'll post questions here. If you want I can open one issue per question. You have a pre-highway dense layer that I don't remember seeing on the paper or on a tf implementation. Why is that?

r9y9 commented 7 years ago

See https://github.com/keithito/tacotron/blob/9fc433a8702a5e8392254f4c2cc6fe035c64e424/models/modules.py#L58-L60

rafaelvalle commented 7 years ago

Looking at your CBHG, it looks like you're not concatenating the output of the convolution banks but feeding the output of each bank onto the next bank, please compare your code and the tf code..

Your CBHG forward pass has:

for conv1d in self.conv1d_banks: 
    x = conv1d(x)[:, :, :T]

The tf code has:

for k in range(2, K+1): # k = 2...K
    with tf.variable_scope("num_{}".format(k)): 
        output = conv1d(inputs, hp.embed_size // 2, k) 
         outputs = tf.concat((outputs, output), -1) 
r9y9 commented 7 years ago

Opps, my bad! Thank you very much for pointing this out. I will fix this and rerun experiments! I will let you know if I have a fix.

r9y9 commented 7 years ago

Okay, I think I have a fix.

rafaelvalle commented 7 years ago

Looks right to me! Will you post something here after you rerun the experiments?

r9y9 commented 7 years ago

Sure, but my GPU is busy right now so it takes time to rerun the experiments.

r9y9 commented 7 years ago

ah, I didn't mean to close this.

rafaelvalle commented 7 years ago

Your attention wrapper seems to be different from the tf implementation.

You concatenate the query and the attention before feeding it into the AttentionRNN whereas in the tf implementation the AttentionRNN only uses the output of the prenet, i.e. the query. Unless tf's attention wrapper does it internally, which I think is the case!

r9y9 commented 7 years ago

You concatenate the query and the attention before feeding it into the AttentionRNN whereas in the tf implementation the AttentionRNN only uses the output of the prenet, i.e. the query. Unless tf's attention wrapper does it internally, which I think is the case!

TensorFlow's AttentionWrapper actually does this internally. https://github.com/tensorflow/tensorflow/blob/624bcfe409601910951789325f0b97f520c0b1ee/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py#L1236-L1238

r9y9 commented 7 years ago

TF's seq2seq APIs are very high-level and often hard to imagine what it really does..

rafaelvalle commented 7 years ago

What data did you use and where did you download it from?

r9y9 commented 7 years ago

Single speaker's 24 hours audio data from https://keithito.com/LJ-Speech-Dataset/.

I use https://github.com/keithito/tacotron#training for data preprocessing.

rafaelvalle commented 7 years ago

During your experiments, was it necessary to clip the norm of the gradients?

r9y9 commented 7 years ago

Well, I didn't try turning off gradient clipping, so I'm not sure if it's actually necessary.

r9y9 commented 7 years ago

Progress: I'm training a new model with the latest code, now at step 600000. Will update the notebook in a few days.

rafaelvalle commented 7 years ago

How do the results look like?

r9y9 commented 7 years ago

Updated: http://nbviewer.jupyter.org/github/r9y9/tacotron_pytorch/blob/f98eda7336726cdfe4ab97ae867cc7f71353de50/notebooks/Test%20Tacotron.ipynb (the link includes git commit hash)

I felt it became a little bit better than before. One interesting thing I noticed is that my implementation can reasonably synthesize long inputs, while keithito/tacotron cannot. Some words will offten be skippped, though. Ref: https://github.com/keithito/tacotron/pull/43#issuecomment-332068107

rafaelvalle commented 7 years ago

Thanks for sharing this. Did you see any change in the attention mechanism? And do you think that the difference from keithito's implementation comes from weight initialization and other parameter defaults?

r9y9 commented 7 years ago

No significant differences from my previous experiment. However, it seems somewhat different from keithito's. I attached learned alignments for

tacotron-tf-alignment_47000steps tacotron-tf-monotonic-alignment_47000steps tacotron-alignment_47000steps

And do you think that the difference from keithito's implementation comes from weight initialization and other parameter defaults?

Sorry, not sure. I guess there's something wrong in model architecture. As for the weight initialization, I tried a few for embedding but got no significant differences. See https://github.com/r9y9/tacotron_pytorch/blob/f98eda7336726cdfe4ab97ae867cc7f71353de50/tacotron_pytorch/tacotron.py#L288-L289.

rafaelvalle commented 7 years ago

That is very interesting. You implementation learns the attention for that sentence much faster and in a cleaner manner than keithito's implementation.

qbx2 commented 6 years ago

Why don't you try initialization on modules' weights & biases? TF's default initializer is Xavier initializer (https://stackoverflow.com/questions/37350131/what-is-the-default-variable-initializer-in-tensorflow) which is different from pytorch's

qbx2 commented 6 years ago

https://github.com/keithito/tacotron/blob/6612fd3ef6760ffe6f7cce93e00fa8421b98e9b3/models/helpers.py#L35 The difference in EOS decision code may be significant, too. (Maybe eps=0.2 is too large in is_end_of_frames?)

Edit: Well, this has no concern in training phase.

r9y9 commented 6 years ago

I know there are some differences in weight initialization between pytorch and tesorflow, but just I haven't try Xavier/truncated normal yet.

eps=0.2 for EOS decision is working practically, though there should be more robust and clever approach. As far as I know, Deep voice 3 predicts EOS binary label during inference. https://arxiv.org/abs/1710.07654

qbx2 commented 6 years ago

https://github.com/keithito/tacotron/blob/6612fd3ef6760ffe6f7cce93e00fa8421b98e9b3/models/tacotron.py#L60 https://github.com/Kyubyong/tacotron_asr/blob/master/networks.py#L81

I think that these tensorflow implementations use (proj + h_1 + h_2) as input of mel_proj.

# Concat RNN output and attention context vector
decoder_input = self.project_to_decoder_in(
    torch.cat((attention_rnn_hidden, current_attention), -1))

# Pass through the decoder RNNs
for idx in range(len(self.decoder_rnns)):
    decoder_rnn_hiddens[idx] = self.decoder_rnns[idx](
        decoder_input, decoder_rnn_hiddens[idx])
    # Residual connectinon
    decoder_input = decoder_rnn_hiddens[idx] + decoder_input

# Last decoder hidden state is the output vector
output = decoder_rnn_hiddens[-1]
output = self.proj_to_mel(output)

However it seems that yours use h_2 from rnn_2(proj + h1) as above. So I think the fixed code would be like this (output = decoder_input):

# Concat RNN output and attention context vector
decoder_input = self.project_to_decoder_in(
    torch.cat((attention_rnn_hidden, current_attention), -1))

# Pass through the decoder RNNs
for idx in range(len(self.decoder_rnns)):
    decoder_rnn_hiddens[idx] = self.decoder_rnns[idx](
        decoder_input, decoder_rnn_hiddens[idx])

    # Residual connectinon
    decoder_input = decoder_input + decoder_rnn_hiddens[idx]

# Last decoder hidden state is the output vector
output = decoder_input
output = self.proj_to_mel(output)
r9y9 commented 6 years ago

Thank you @qbx2, you are right. FIxed by https://github.com/r9y9/tacotron_pytorch/commit/059272508e4bd24e8ba6028e02a247c6cbc4e1b3.

qbx2 commented 6 years ago

Nice. I'd love to see improved results after the commit.

qbx2 commented 6 years ago

Oh, and I'd like to suppose to give bias=False in BatchNormConv1d here for better calculation speed. The bias is not needed because BatchNorm1d layer would add it.

        self.conv1d = nn.Conv1d(in_dim, out_dim,
                                kernel_size=kernel_size,
                                stride=stride, padding=padding)

https://github.com/r9y9/tacotron_pytorch/blob/master/tacotron_pytorch/tacotron.py#L32

r9y9 commented 6 years ago

Fixed https://github.com/r9y9/tacotron_pytorch/commit/e56cdab2a8ef221ba685f8908d63d30abe1d5abc

rafaelvalle commented 6 years ago

FYI: pytorch does not have truncated weight initialization

r9y9 commented 6 years ago

I knew it, but haven't tried it because

1 .it doesn't exist in PyTorch.

  1. I don't think it's crutial.

However, I had a experience just now initialization was quite important to learn deep models. So now I think it's definitely worth trying.

zhbbupt commented 6 years ago

I'm trying to use BahdanauMonotonicAttention. But I cann't get the same result as keithito's, (Bahdanau-style) Monotonic Attention. could anybody take a look? https://github.com/r9y9/tacotron_pytorch/issues/8 @r9y9 @rafaelvalle @qbx2

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.