nii-yamagishilab / self-attention-tacotron

An implementation of "Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language" https://arxiv.org/abs/1810.11960
BSD 3-Clause "New" or "Revised" License
114 stars 32 forks source link

what's the input representation? #26

Open pandaGst opened 5 years ago

pandaGst commented 5 years ago

Hello, well job! May I know what is your input of this tacotron, linguistic features or charctor sequence?

TanUkkii007 commented 5 years ago

Hello, @pandaGst . It depends on what dataset you use. For VCTK and LJSpeech datasets which are provided as an example in this repository, the input is character sequence. In our paper for a Japanese corpus, the input is linguistic feature, phoneme sequence and accentual type label https://arxiv.org/abs/1810.11960.

The main network implementation is not restricted to a specific input format. If you develop your own dataset, I think you can combine it without much work.

hyzhan commented 4 years ago

How to get these linguistic features, phoneme sequences and accentual type labels?

TanUkkii007 commented 3 years ago

In our paper, we used manual annotation.