rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
6.71k stars 492 forks source link

Overwrite speaker in multi speaker model #625

Closed minho-lee424 closed 1 month ago

minho-lee424 commented 1 month ago

Hi!

I am currently working with an English multi-speaker model that includes 29 speakers, all of which are female voices. I would like to add a male voice to this model while keeping the remaining 28 female voices for future use.

I have already prepared a dataset for a single male voice.

Could anyone please guide me on how to overwrite one of the speaker embeddings with this new male voice?

Thank you for your assistance in advance!

synesthesiam commented 1 month ago

Sure, you just need to add a speaker id to your dataset (probably 0 for the first speaker or 28 for the last), then fine-tune from the multi-speaker model.

minho-lee424 commented 1 month ago

Thank you for the response.

I understand how to add a new voice by assigning a speaker ID and fine-tuning. However, my concern is preserving the existing voices in a multi-speaker model without overlap.

Also, I agree with the approach in #333, where overwriting an existing speaker may be faster or more efficient than fine-tuning for a single new voice.

Is there any alternative to fine-tuning and overwriting a voice, or is this the only option?

Thank you for your help!

synesthesiam commented 1 month ago

Is there any alternative to fine-tuning and overwriting a voice, or is this the only option?

Unfortunately, no unless there were extra speaker ids added in the original model that weren't used. The size of the model layers depends on the number of speakers, so you can't just extend it (at least I don't know how).

nshmyrev commented 1 month ago

It is actually easy to extend, you load the speaker embedding matrix and add extra row then save the model back, like this:


net_g = SynthesizerTrn()
utils.load_checkpoint("G.pth")
with torch.no_grad():
    new_embedding = nn.Embedding(10, 256)
    nn.init.normal_(new_embedding.weight, 0.0, 256 ** -0.5)
    for i in range(10):
        lab = id2spk.get(i)
        if lab in spk2idold:
            new_embedding.weight[i] = net_g.emb_g.weight[spk2idold.get(lab)]
    net_g.emb_g = new_embedding
    net_g.n_speakers = 10

state_dict = net_g.state_dict()
torch.save({'model': state_dict,
              'iteration': iteration,
              'optimizer': optimizer,
              'learning_rate': learning_rate}, "G.pth")
synesthesiam commented 1 month ago

Nice, thanks @nshmyrev!

Also make sure to update the JSON config.

minho-lee424 commented 1 month ago
import torch
from torch import nn

from vits.lightning import VitsModel

model = VitsModel.load_from_checkpoint("../model/en_final/epoch=999-step=354000.ckpt", dataset=None)
model_g = model.model_g

with torch.no_grad():
    new_embedding = nn.Embedding(30, 512)
    nn.init.normal_(new_embedding.weight, 0.0, 512 ** -0.5)
    for i in range(29):
        new_embedding.weight[i] = model_g.emb_g.weight[torch.LongTensor([i])]
    new_embedding.weight[29] = new_speaker_embedding.weight[torch.LongTensor([0])]
    model_g.emb_g = new_embedding
    model_g.n_speakers = 30

state_dict = model_g.state_dict()

I added new speaker embeddings and performed inference as shown in the code above, but the voice sounded similar to that of speaker 0.

nshmyrev commented 1 month ago

but the voice sounded similar to that of speaker 0.

Right, now you need to finetune the model on new speaker data, either just embedding or the full model.

minho-lee424 commented 1 month ago

Right, now you need to finetune the model on new speaker data, either just embedding or the full model.

Ahh right. I got it. It is how to extend speakers. So, the only way to change voices is to update your training dataset and fine-tune the model.

Thanks for the help!