jaywalnut310 / glow-tts

A Generative Flow for Text-to-Speech via Monotonic Alignment Search
MIT License
660 stars 151 forks source link

Ideal size of gin_channels for multiple speaker embeddings? #14

Open echelon opened 4 years ago

echelon commented 4 years ago

Hi Jaehyeon, I modified your code to train multiple speakers and it seems to be training and inferring pretty well. Thanks for leaving the code in a state that makes this relatively easy!

Here are my hparams:

    "n_speakers": 10, 
    "gin_channels": 16

I have nine speakers, but mistakenly didn't zero index them.

Is gin_channels too small? Should this be appreciably larger to capture the voice characteristics? 32? 64? ...?

Two of the speakers have four hours of data. Other speakers have far less. Oddly, the speaker with the smallest amount of data seems to have one of the clearest voices. Other speakers don't sound like their source at all.

I'm only epoch 1400 in so far and I had to train from zero, so this has got a long way to go. Should I abandon this and increase gin_channels, or does it seem fair to proceed?

jaywalnut310 commented 4 years ago

@echelon Hi echelon. As I haven't tested on such small datasets, I couldn't give you a solution. Sorry for that. In my case, I didn't care much for the dimension, so I set gin_channels to be big enough. Therfore, I trained my model on LibriTTS with 256 dimensional gin_channels.

I think big gin_channels does not harm in your case, either. I hope it would be helpful for your case :)

echelon commented 4 years ago

Thanks so much for the feedback! 256 dimensions performs much better as far as I can tell.

It's perhaps a little premature to report my findings, but I've performed the following:

I'll report back when I've had longer to let this run on my two 1080 Ti cards, but the early results already seem promising.

marlon-br commented 4 years ago

Thanks so much for the feedback! 256 dimensions performs much better as far as I can tell.

It's perhaps a little premature to report my findings, but I've performed the following:

  • Train 64 n_speaker model with 256 gin_channels. All channels are trained and validated on LJS sample data distributed evenly across the speaker tokens [0,64), with 10% withheld for validation evenly across the same channels.
  • After training all 64 channels on LJS, substitute an arbitrary number of low-number speakers with novel data sets. (I'm currently training 10 voices.) The remaining speaker channels must continue to be trained on LJS or the model loses fit.

I'll report back when I've had longer to let this run on my two 1080 Ti cards, but the early results already seem promising.

Could you please make a small google colab file demonstrating how to add one more speaker? To be able to convert text to the voice of the new speaker. Thanks!