syang1993 / gst-tacotron

A tensorflow implementation of the "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"
368 stars 110 forks source link

Train as a Tacotron1 script problem #15

Open dazenhom opened 6 years ago

dazenhom commented 6 years ago

Thanks for your great work, but I found that if I set the hyperparameter use_gst=False and run, it seemed different from my understanding of Tacotron1. The tacotron.py code is part of here.

      if reference_mel is not None:
        # Reference encoder
        refnet_outputs = reference_encoder(
          reference_mel, 
          filters=hp.reference_filters, 
          kernel_size=(3,3),
          strides=(2,2),
          encoder_cell=GRUCell(hp.reference_depth),
          is_training=is_training)                                                 # [N, 128]
        self.refnet_outputs = refnet_outputs                                       

        if hp.use_gst:
          # Style attention
          style_attention = MultiheadAttention(
            tf.expand_dims(refnet_outputs, axis=1),                                   # [N, 1, 128]
            tf.tanh(tf.tile(tf.expand_dims(gst_tokens, axis=0), [batch_size,1,1])),            # [N, hp.num_gst, 256/hp.num_heads]   
            num_heads=hp.num_heads,
            num_units=hp.style_att_dim,
            attention_type=hp.style_att_type)

          style_embeddings = style_attention.multi_head_attention()                   # [N, 1, 256]
        else:
          style_embeddings = tf.expand_dims(refnet_outputs, axis=1)                   # [N, 1, 128]
      else:
        print("Use random weight for GST.")
        random_weights = tf.random_uniform([hp.num_heads, hp.num_gst], maxval=1.0, dtype=tf.float32)
        random_weights = tf.nn.softmax(random_weights, name="random_weights")
        style_embeddings = tf.matmul(random_weights, tf.nn.tanh(gst_tokens))
        style_embeddings = tf.reshape(style_embeddings, [1, 1] + [hp.num_heads * gst_tokens.get_shape().as_list()[1]])

Original Tacotron1 code shoudn't train with the reference encoder part right? However, your code pass the non-gst mode data into a reference_encoder, which sounds strange ? Maybe we can exchange the two IF condition codes to make it correct.

if hp.use_gst: 
***
if reference_mel is not None:  
***

THANKS

syang1993 commented 6 years ago

@dazenhom Hi, thanks for your notes. In this repo, using use_gst=False doesn't mean the tacotron1 model. Google also has another paper, which uses reference encoder to do style and multi-speaker synthesis. You can found it at https://arxiv.org/abs/1803.09047.

dazenhom commented 6 years ago

@syang1993 Thanks for your reply, I took a mistake with Tacotron1 from your work. I shall find another version of Tacotron1 to run my test. Thanks anyway.

hyzhan commented 6 years ago

I have try “use_gst=False”, but it seems to be the same as tacotron1? Although the refnet_outputs will change, but the generated audio will hardly change with different reference audio.

dazenhom commented 6 years ago

@hyzhan In my experienc, maybe it's because of your data. If you use some expressive speakers as your trainning data and do the inference, the speech can be different(changed with the reference audio) . Otherwise, it can remain little different as you mentioned.