nii-yamagishilab / self-attention-tacotron

An implementation of "Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language" https://arxiv.org/abs/1810.11960
BSD 3-Clause "New" or "Revised" License
113 stars 32 forks source link

self_attention #2

Closed maozhiqiang closed 5 years ago

maozhiqiang commented 5 years ago

hi , Thank you for your hard work, I am very interested in self_attention, how to combine self_attention to tacotron2 project ? how to modify the network structure?

TanUkkii007 commented 5 years ago

@maozhiqiang Thanks for your interest to our work.

The provided recipies in the examples directories are based on Tacotron rather Tacotron2. You have to modify configurations to test Tacotron2.

This is Tacotron2 setting same as the original paper. It uses LJSpeech as a dataset. You can use --hparams option to override configuration from provided recipie.

bazel run train -- --source-data-root=/path/to/source   --target-data-root=/path/to/target   --checkpoint-dir=/path/to/checkpoint --selected-list-dir=self-attention-tacotron/examples/ljspeech --hparam-json-file=/Users/yasuda/project/self-attention-tacotron/examples/ljspeech/tacotron.json --hparams=tacotron_model="ExtendedTacotronV1Model",encoder="EncoderV2",decoder="ExtendedDecoder",use_postnet_v2=True,embedding_dim=512,initial_learning_rate=0.0005,decoder_prenet_out_units=[256,256],decoder_out_units=1024,attention_out_units=128,outputs_per_step=1,n_feed_frame=1,max_iters=1000,attention=location_sensitive,attention_kernel=31,attention_filters=32,use_zoneout_at_encoder=True,decoder_version="v2",dataset="ljspeech.dataset.DatasetSource",save_checkpoints_steps=379,num_symbols=256,use_l2_regularization=True,l2_regularization_weight=1e-6

You can replace the Tracotron2 decoder with TransformerDecoder by the decoder option. You can use the following command to test Tacotron2 with self-attention decoder.

bazel run train -- --source-data-root=/path/to/source   --target-data-root=/path/to/target   --checkpoint-dir=/path/to/checkpoint --selected-list-dir=self-attention-tacotron/examples/ljspeech --hparam-json-file=/Users/yasuda/project/self-attention-tacotron/examples/ljspeech/tacotron.json --hparams=tacotron_model="ExtendedTacotronV1Model",encoder="EncoderV2",decoder="ExtendedDecoder",use_postnet_v2=True,embedding_dim=512,initial_learning_rate=0.0005,decoder_prenet_out_units=[256,256],decoder_out_units=1024,attention_out_units=128,outputs_per_step=1,n_feed_frame=1,max_iters=1000,attention=location_sensitive,attention_kernel=31,attention_filters=32,use_zoneout_at_encoder=True,decoder_version="v2",dataset="ljspeech.dataset.DatasetSource",save_checkpoints_steps=379,num_symbols=256,use_l2_regularization=True,l2_regularization_weight=1e-6,decoder_self_attention_out_units=1024,batch_size=16

Note that I add two additional configurations in this example command. decoder_self_attention_out_units=1024 is required because we must adjust the dimension with decoder_out_units=1024 due to residual connection. The smaller batch size is required as in batch_size=16, because self-attention has large matrix multiplication at decoder that results in OOM when reduction factor is 1. You can also mitigate the OOM issue by using larger reduction factor with outputs_per_step=2 but you have to sacrifice audio quality. Similar workaround can be found in https://arxiv.org/abs/1809.08895 .

maozhiqiang commented 5 years ago

@TanUkkii007 , thanks for your detailed explanation! I will try It, thanks again!