jaywalnut310 / glow-tts

A Generative Flow for Text-to-Speech via Monotonic Alignment Search
MIT License
667 stars 150 forks source link

Add new speaker voice #13

Open marlon-br opened 4 years ago

marlon-br commented 4 years ago

Hi Jaehyeon,

Could you please provide instructions how to use pretrained model and add new speaker voice?

I have created google colab file basing on your work: https://github.com/marlon-br/glow-tts-colab Now I want to add a possibility to have more speaker voices.

echelon commented 4 years ago

Add these two hparams:

"n_speakers": 10,
"gin_channels": 16     

I'm not sure what the ideal value for gin_channels is to get a rich embedding, and I asked in another thread.

Your training data and validation CSVs should be in this format:

filename|numeric_speaker_id|transcript

You'll need to swap out the loader:

-from data_utils import TextMelLoader, TextMelCollate 
+from data_utils import TextMelSpeakerLoader, TextMelSpeakerCollate       

You'll also need to change the forward function to accept the g speaker id parameter and unpack the speaker ids from the loader enumerations.

marlon-br commented 4 years ago

i meant not to retrain the whole model once again. only to add one more voice

dechubby commented 4 years ago

Add these two hparams:

"n_speakers": 10,
"gin_channels": 16     

I'm not sure what the ideal value for gin_channels is to get a rich embedding, and I asked in another thread.

Your training data and validation CSVs should be in this format:

filename|numeric_speaker_id|transcript

You'll need to swap out the loader:

-from data_utils import TextMelLoader, TextMelCollate 
+from data_utils import TextMelSpeakerLoader, TextMelSpeakerCollate       

You'll also need to change the forward function to accept the g speaker id parameter and unpack the speaker ids from the loader enumerations.

Sorry for jumping in, could you please elaborate the last part about changing the forward function? Thanks in advance!

ppanja commented 3 years ago

Add these two hparams:

"n_speakers": 10,
"gin_channels": 16     

I'm not sure what the ideal value for gin_channels is to get a rich embedding, and I asked in another thread.

Your training data and validation CSVs should be in this format:

filename|numeric_speaker_id|transcript

You'll need to swap out the loader:

-from data_utils import TextMelLoader, TextMelCollate 
+from data_utils import TextMelSpeakerLoader, TextMelSpeakerCollate       

You'll also need to change the forward function to accept the g speaker id parameter and unpack the speaker ids from the loader enumerations.

Hi @echelon , This information is really useful. I believe I've done necessary changes as suggested by you. In my case I've kept n_speakers = 24 and gin_channels = 256 and rest of the parameters in base.json is same. Number of samples in training records are 9102. I'm getting below runtime error.

RuntimeError: Given groups=1, weight of size 256 448 3, expected input[1, 192, 89] to have 448 channels, but got 192 channels instead

Can you please advice what is going wrong here.

ppanja commented 3 years ago

Hi @marlon-br, @dechubby , Were you able to run in multi speaker mode? Have you done any other changes apart from whatever mentioned by echelon? I'm getting some issue which I'm not able to debug.

Any help will be really appreciated.

Regards, Prasanta