Closed maozhiqiang closed 5 years ago
@maozhiqiang Thanks for your interest to our work.
The provided recipies in the examples
directories are based on Tacotron rather Tacotron2. You have to modify configurations to test Tacotron2.
This is Tacotron2 setting same as the original paper. It uses LJSpeech as a dataset. You can use --hparams
option to override configuration from provided recipie.
bazel run train -- --source-data-root=/path/to/source --target-data-root=/path/to/target --checkpoint-dir=/path/to/checkpoint --selected-list-dir=self-attention-tacotron/examples/ljspeech --hparam-json-file=/Users/yasuda/project/self-attention-tacotron/examples/ljspeech/tacotron.json --hparams=tacotron_model="ExtendedTacotronV1Model",encoder="EncoderV2",decoder="ExtendedDecoder",use_postnet_v2=True,embedding_dim=512,initial_learning_rate=0.0005,decoder_prenet_out_units=[256,256],decoder_out_units=1024,attention_out_units=128,outputs_per_step=1,n_feed_frame=1,max_iters=1000,attention=location_sensitive,attention_kernel=31,attention_filters=32,use_zoneout_at_encoder=True,decoder_version="v2",dataset="ljspeech.dataset.DatasetSource",save_checkpoints_steps=379,num_symbols=256,use_l2_regularization=True,l2_regularization_weight=1e-6
You can replace the Tracotron2 decoder with TransformerDecoder by the decoder
option. You can use the following command to test Tacotron2 with self-attention decoder.
bazel run train -- --source-data-root=/path/to/source --target-data-root=/path/to/target --checkpoint-dir=/path/to/checkpoint --selected-list-dir=self-attention-tacotron/examples/ljspeech --hparam-json-file=/Users/yasuda/project/self-attention-tacotron/examples/ljspeech/tacotron.json --hparams=tacotron_model="ExtendedTacotronV1Model",encoder="EncoderV2",decoder="ExtendedDecoder",use_postnet_v2=True,embedding_dim=512,initial_learning_rate=0.0005,decoder_prenet_out_units=[256,256],decoder_out_units=1024,attention_out_units=128,outputs_per_step=1,n_feed_frame=1,max_iters=1000,attention=location_sensitive,attention_kernel=31,attention_filters=32,use_zoneout_at_encoder=True,decoder_version="v2",dataset="ljspeech.dataset.DatasetSource",save_checkpoints_steps=379,num_symbols=256,use_l2_regularization=True,l2_regularization_weight=1e-6,decoder_self_attention_out_units=1024,batch_size=16
Note that I add two additional configurations in this example command. decoder_self_attention_out_units=1024
is required because we must adjust the dimension with decoder_out_units=1024
due to residual connection. The smaller batch size is required as in batch_size=16
, because self-attention has large matrix multiplication at decoder that results in OOM when reduction factor is 1. You can also mitigate the OOM issue by using larger reduction factor with outputs_per_step=2
but you have to sacrifice audio quality. Similar workaround can be found in https://arxiv.org/abs/1809.08895 .
@TanUkkii007 , thanks for your detailed explanation! I will try It, thanks again!
hi , Thank you for your hard work, I am very interested in self_attention, how to combine self_attention to tacotron2 project ? how to modify the network structure?