jaywalnut310 / vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
https://jaywalnut310.github.io/vits-demo/index.html
MIT License
6.48k stars 1.21k forks source link

Problems adding a new speaker #200

Open JoanisTriandafilidi opened 6 months ago

JoanisTriandafilidi commented 6 months ago

Hello! Thanks for the great job! I actively use Vits in various personal mini-projects and I had an idea related to adding new speakers to the multi-speaker model.

The essence of my idea is this:

  1. I trained a good multispeaker model for 200 speakers.
  2. I received an embedding for a new speaker of a suitable format using Speakernet.
  3. I want to add a new speaker to an existing multispeaker model by adding a new embed. That is, emb_g.shape was equal to [200, 192], but will become [201, 192]. I'm adding a new embedding to the utils.load_checkpoint function.

The model loads without problems - however, on the inference, instead of the expected new (!) voice, I get one of the 200 already trained voices. Moreover, if I apply some other embedding to the input, I will get some other voice from these 200. So I can conclude that the model can potentially generate voices for artificially added speakers. But I can't get the voice to match the target.

Could you please tell me how I can solve this problem? Why, when the model sees a new embedding, does it generate a different voice?