Closed minho-lee424 closed 1 month ago
Sure, you just need to add a speaker id to your dataset (probably 0 for the first speaker or 28 for the last), then fine-tune from the multi-speaker model.
Thank you for the response.
I understand how to add a new voice by assigning a speaker ID and fine-tuning. However, my concern is preserving the existing voices in a multi-speaker model without overlap.
Also, I agree with the approach in #333, where overwriting an existing speaker may be faster or more efficient than fine-tuning for a single new voice.
Is there any alternative to fine-tuning and overwriting a voice, or is this the only option?
Thank you for your help!
Is there any alternative to fine-tuning and overwriting a voice, or is this the only option?
Unfortunately, no unless there were extra speaker ids added in the original model that weren't used. The size of the model layers depends on the number of speakers, so you can't just extend it (at least I don't know how).
It is actually easy to extend, you load the speaker embedding matrix and add extra row then save the model back, like this:
net_g = SynthesizerTrn()
utils.load_checkpoint("G.pth")
with torch.no_grad():
new_embedding = nn.Embedding(10, 256)
nn.init.normal_(new_embedding.weight, 0.0, 256 ** -0.5)
for i in range(10):
lab = id2spk.get(i)
if lab in spk2idold:
new_embedding.weight[i] = net_g.emb_g.weight[spk2idold.get(lab)]
net_g.emb_g = new_embedding
net_g.n_speakers = 10
state_dict = net_g.state_dict()
torch.save({'model': state_dict,
'iteration': iteration,
'optimizer': optimizer,
'learning_rate': learning_rate}, "G.pth")
Nice, thanks @nshmyrev!
Also make sure to update the JSON config.
import torch
from torch import nn
from vits.lightning import VitsModel
model = VitsModel.load_from_checkpoint("../model/en_final/epoch=999-step=354000.ckpt", dataset=None)
model_g = model.model_g
with torch.no_grad():
new_embedding = nn.Embedding(30, 512)
nn.init.normal_(new_embedding.weight, 0.0, 512 ** -0.5)
for i in range(29):
new_embedding.weight[i] = model_g.emb_g.weight[torch.LongTensor([i])]
new_embedding.weight[29] = new_speaker_embedding.weight[torch.LongTensor([0])]
model_g.emb_g = new_embedding
model_g.n_speakers = 30
state_dict = model_g.state_dict()
I added new speaker embeddings and performed inference as shown in the code above, but the voice sounded similar to that of speaker 0.
but the voice sounded similar to that of speaker 0.
Right, now you need to finetune the model on new speaker data, either just embedding or the full model.
Right, now you need to finetune the model on new speaker data, either just embedding or the full model.
Ahh right. I got it. It is how to extend speakers. So, the only way to change voices is to update your training dataset and fine-tune the model.
Thanks for the help!
Hi!
I am currently working with an English multi-speaker model that includes 29 speakers, all of which are female voices. I would like to add a male voice to this model while keeping the remaining 28 female voices for future use.
I have already prepared a dataset for a single male voice.
Could anyone please guide me on how to overwrite one of the speaker embeddings with this new male voice?
Thank you for your assistance in advance!