Hello! Thanks for the great job!
I actively use Vits in various personal mini-projects and I had an idea related to adding new speakers to the multi-speaker model.
The essence of my idea is this:
I trained a good multispeaker model for 200 speakers.
I received an embedding for a new speaker of a suitable format using Speakernet.
I want to add a new speaker to an existing multispeaker model by adding a new embed. That is, emb_g.shape was equal to [200, 192], but will become [201, 192]. I'm adding a new embedding to the utils.load_checkpoint function.
The model loads without problems - however, on the inference, instead of the expected new (!) voice, I get one of the 200 already trained voices. Moreover, if I apply some other embedding to the input, I will get some other voice from these 200. So I can conclude that the model can potentially generate voices for artificially added speakers. But I can't get the voice to match the target.
Could you please tell me how I can solve this problem? Why, when the model sees a new embedding, does it generate a different voice?
Hello! Thanks for the great job! I actively use Vits in various personal mini-projects and I had an idea related to adding new speakers to the multi-speaker model.
The essence of my idea is this:
The model loads without problems - however, on the inference, instead of the expected new (!) voice, I get one of the 200 already trained voices. Moreover, if I apply some other embedding to the input, I will get some other voice from these 200. So I can conclude that the model can potentially generate voices for artificially added speakers. But I can't get the voice to match the target.
Could you please tell me how I can solve this problem? Why, when the model sees a new embedding, does it generate a different voice?