Training Multi-Speaker Model.

syang1993 / gst-tacotron

A tensorflow implementation of the "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"

368 stars 110 forks source link

Training Multi-Speaker Model. #21

Closed sujithpadar closed 5 years ago

sujithpadar commented 5 years ago

Hi, I had success while using this model for single speaker data, But I am not sure how to scale it for a multi-speaker setting. Is it just changing the data or should there be some changes to the model? Have anyone tried this before??

syang1993 commented 5 years ago

@sujithpadar I modified this code to train a multi-speaker model using 4 speakers and it works. If you want to train multi-speaker model, you need to add an extra speaker embedding, and feed speaker information into model. You can see another paper " Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron".

sujithpadar commented 5 years ago

@syang1993 Thanks a lot for the quick response, Really appreciate it. Looking at the paper I could gather that, 1) I need to concatenate speaker embedding to the encoder output and there is no change in the Cost function. 2) I need to add the speaker Identifier for the model as well.

I found such an implementation here, https://github.com/keithito/tacotron/tree/multispeaker

Can I take this as the reference make necessary modifications? Sorry to bug you, as I'm new to deep learning. Or better if you have the implementation of that handy, can you share it?

Thanks!!!

syang1993 commented 5 years ago

@sujithpadar Hi, you are right. You only need to add speaker embedding and concat it with style embedding. Thus you only need to modify dataset.py, tacotron.py and small changes in train.py.

I'm sorry that now I'm interning at a company, so now I cannot upload any code about my work .

sujithpadar commented 5 years ago

@syang1993 Thanks a lot, I think I'll be able to manage that.

freds0 commented 5 years ago

@sujithpadar, did you succeed in developing a multi speaker version? If yes, can you share it?