keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.95k stars 961 forks source link

Multi-speaker Tacotron #18

Open candlewill opened 7 years ago

candlewill commented 7 years ago

According to Baidu deep voice 2, it is possible to modify the original Tacotron architecture into multi-speaker version. The main architecture modification is:

image

Based on @keithito 's implementation and my understating of the paper, I tried to reproduce it. But so far, I can not get good result.

Does anyone have some advice? Does anyone also try or plan to implement the multi-speaker version?

fanskyer commented 7 years ago

I implemented a simple version that enable speaker embedding, so far it can distinguish male/female, but the quality is bad maybe it is still in the early training stage.

ElevenGameStudios commented 7 years ago

Compared to single-speaker, I think it is also necessary to increase the size of certain parts of the net to better "soak up" the extra speaker information.. But since the deep voice guys suggested embedding the speaker id at a lot of sites, I cannot tell which of those parts would be the most important. My imagined ideal implementation would be one, where the speaker embedding vector could be exposed to customize the resulting voice. (e.g. using 90% of speaker a mixed with 10% of speaker b) Did you guys implement it just like the paper suggested? I'd love to hear about some more details and/or results.

candlewill commented 7 years ago

I have tried many ways to put the speaker embedding into Tacotron to synthesize multiple speakers' voice.

However, the model can synthesize up to 2 person's voice. If more than 2 person, the alignment info cannot be learned right.

Here are some samples generated at eval step: Person 1: http://pan.baidu.com/s/1skOYz97 Person 2: http://pan.baidu.com/s/1o8KBc1O

More could be found here: http://t.cn/R9k6GNi

candlewill commented 7 years ago

Trained on the VCTK dataset, the model can synthesize more speaks' voice. Here is a subset of synthesized voice: http://t.cn/RC21sqM

lifeiteng commented 7 years ago

@candlewill In Deep Voice 2, they also mentioned that some speakers were failed to learn the alignment.

DarkDefender commented 7 years ago

@candlewill Sounds like you have made some nice progress. Did you change anything from your previous post? (Because it now seems to work with more that to people)

Any plans to get your changes merged upstream?

candlewill commented 7 years ago

@DarkDefender I changed the way to merge the speaker identify based on this paper, Listening while Speaking: Speech Chain by Deep Learning.

image

The speaker embedding is only concatenated with the attention RNN input and output, not used anywhere else.

PooriaKhani commented 6 years ago

@candlewill could you please share your code? what does alignment picture look like? how should I change the code?

hutauf commented 6 years ago

I would also like to get that code.

Would it be possible to use a pretrained model (for example LJSpeech), then use a different dataset (with a different speaker) to learn the new voice? Would we need less data then ("one shot learning")? Maybe make a lot of the network untrainable and only train the decoder part?

carpedm20 commented 6 years ago

I changed this code to reproduce Deep voice 2 and Listening while Speaking and got a good results of 4 Korean public figures (current and former presidents, one anchor and one voice actor).

Code : https://github.com/carpedm20/multi-speaker-tacotron-tensorflow Audio samples : http://carpedm20.github.io/tacotron

Even though two figures have only 2 and 5 hours of naively aligned text (90% are correctly aligned but 10% have missing word or wrong sentence), speech pair for each, the results are pretty interesting. Thanks @keithito for your great work!

DarkDefender commented 6 years ago

@carpedm20 Did something unexpected happen? It seems like the github repo you linked has vanished...

carpedm20 commented 6 years ago

@DarkDefender Company decided to close the code because the release of code arises worries about the potential offensive usage of the code in Korean society..

Fix: updated the link.

DarkDefender commented 6 years ago

@carpedm20 Ah, I see... Thanks for the explaination. I hope that some day it will be open again. The quality seemed to be quite nice.

buma commented 6 years ago

It seems repository is back up.