rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
6.56k stars 478 forks source link

finetuning from previous checkpoints fails #108

Open meriamOu opened 1 year ago

meriamOu commented 1 year ago

Hello, thank you so much for the great work! I have tried to resume training using the previous checkpoints : python3 -m piper_train --dataset-dir /home/dataset/fr --accelerator 'gpu' --devices 1 --batch-size 32 --validation-split 0.01 --num-test-examples 5 --max_epochs 10000 --precision 32 --quality x-low --resume_from_checkpoint ryan/x-low/'epoch=6189-step=1733200.ckpt'

and with ljspeech checkpoint: python3 -m piper_train --dataset-dir /home/fr --accelerator 'gpu' --devices 1 --batch-size 32 --validation-split 0.01 --num-test-examples 5 --max_epochs 10000 --precision 32 --quality x-low --resume_from_checkpoint /home/ljspeech/low/epoch=4429-step=1745812.ckpt

It throws an error of model mismatch in both cases: RuntimeError: Error(s) in loading state_dict for VitsModel: Missing key(s) in state_dict: "model_g.dec.cond.weight", "model_g.dec.cond.bias", "model_g.enc_q.enc.cond_layer.bias", "model_g.enc_q.enc.cond_layer.weight_g", "model_g.enc_q.enc.cond_layer.weight_v", "model_g.flow.flows.0.enc.cond_layer.bias", "model_g.flow.flows.0.enc.cond_layer.weight_g", "model_g.flow.flows.0.enc.cond_layer.weight_v", "model_g.flow.flows.2.enc.cond_layer.bias", "model_g.flow.flows.2.enc.cond_layer.weight_g", "model_g.flow.flows.2.enc.cond_layer.weight_v", "model_g.flow.flows.4.enc.cond_layer.bias", "model_g.flow.flows.4.enc.cond_layer.weight_g", "model_g.flow.flows.4.enc.cond_layer.weight_v", "model_g.flow.flows.6.enc.cond_layer.bias", "model_g.flow.flows.6.enc.cond_layer.weight_g", "model_g.flow.flows.6.enc.cond_layer.weight_v", "model_g.dp.cond.weight", "model_g.dp.cond.bias", "model_g.emb_g.weight". size mismatch for model_g.enc_p.emb.weight: copying a param with shape torch.Size([130, 96]) from checkpoint, the shape in current model is torch.Size([256, 96]).

Has the model changed since the last checkpoints release? Looking forward to hearing back. thank you

synesthesiam commented 1 year ago

You're welcome! Yes, the model has changed slightly in v1.0.0. I increased the number of phonemes (or symbols) to 256 in order to accommodate more languages. I'm only using 150 or so right now, so there's plenty of room for the future without running into the same problem.

Checkpoints for the new model size are available here: https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main I'd recommend the lessac one as a starting point. It's what I've used to train every other model there. A high quality lessac model is training right now.

meriamOu commented 1 year ago

thank you so much for your immediate reply. I have downloaded the checkpoint lessac/low/epoch=1811-step=438504.ckpt and the release piper-1.0.0. But i still have model mismatch, the number of phonemes from the checkpoint is still 130

python3 -m piper_train --dataset-dir /home/fr --accelerator 'gpu' --devices 1 --batch-size 64 --validation-split 0.01 --num-test-examples 5 --max_epochs 10000 --precision 32 --resume_from_checkpoint /lessac/low/epoch=1811-step=438504.ckpt --quality x-low

ERROR: RuntimeError: Error(s) in loading state_dict for VitsModel: Missing key(s) in state_dict: "model_g.dec.cond.weight", "model_g.dec.cond.bias", "model_g.enc_q.enc.cond_layer.bias", "model_g.enc_q.enc.cond_layer.weight_g", "model_g.enc_q.enc.cond_layer.weight_v", "model_g.flow.flows.0.enc.cond_layer.bias", "model_g.flow.flows.0.enc.cond_layer.weight_g", "model_g.flow.flows.0.enc.cond_layer.weight_v", "model_g.flow.flows.2.enc.cond_layer.bias", "model_g.flow.flows.2.enc.cond_layer.weight_g", "model_g.flow.flows.2.enc.cond_layer.weight_v", "model_g.flow.flows.4.enc.cond_layer.bias", "model_g.flow.flows.4.enc.cond_layer.weight_g", "model_g.flow.flows.4.enc.cond_layer.weight_v", "model_g.flow.flows.6.enc.cond_layer.bias", "model_g.flow.flows.6.enc.cond_layer.weight_g", "model_g.flow.flows.6.enc.cond_layer.weight_v", "model_g.dp.cond.weight", "model_g.dp.cond.bias", "model_g.emb_g.weight". size mismatch for model_g.enc_p.emb.weight: copying a param with shape torch.Size([130, 192]) from checkpoint, the shape in current model is torch.Size([256, 96]). size mismatch for model_g.enc_p.encoder.attn_layers.0.emb_rel_k: copying a param with shape torch.Size([1, 9, 96]) from checkpoint, the shape in current model is torch.Size([1, 9, 48]). size mismatch for model_g.enc_p.encoder.attn_layers.0.emb_rel_v: copying a param with shape torch.Size([1, 9, 96]) from checkpoint, the shape in current model is torch.Size([1, 9, 48]). size mismatch for model_g.enc_p.encoder.attn_layers.0.conv_q.weight: copying a param with shape torch.Size([192, 192, 1]) from checkpoint, the shape in current model is torch.Size([96, 96, 1]). ...

synesthesiam commented 1 year ago

You may need to re-run the preprocessing script with version 1.0 to ensure that the correct parameters are written to config.json.

meriamOu commented 1 year ago

thank you so much for replying. I have re-run the preprocessing with 1.0. and it is the same error.: size mismatch for model_g.enc_p.emb.weight: copying a param with shape torch.Size([130, 96]) from checkpoint, the shape in current model is torch.Size([256, 96]).... I think the error is because the checkpoint was trained with a model that has 130 symbols. The config.json of version 1.0 has " "num_symbols": 256," and the model expects [256, 96]. which aligns with the number of symbols 256. However the checkpoint has dimension [130, 96] and the config file has "num_symbols": 130. I also noticed that the checkpoint for lessac medium has 256 symbols. it is only the checkpoint for lessac low that has 130 symbols.

sweetbbak commented 1 year ago

Did you happen to resolve this issue? I'm having the same problem, except I used the current lessac ckpt model that came out today

synesthesiam commented 1 year ago

Yes, the "low quality" lessac checkpoint was accidentally trained with the wrong parameters. I'm retraining it now :slightly_smiling_face:

I'd suggest using the lessac medium checkpoint (22050 Hz sample rate).

sweetbbak commented 1 year ago

Ah, I see. Thank you!

dic1911 commented 1 year ago

@synesthesiam I tried the same checkpoint, but I still got the error, is there anything I could do to get around it? thanks

$ python3 -m piper_train \
    --dataset-dir ./train_0731/ \
    --accelerator 'gpu' \
    --devices 1 \
    --batch-size 16 \
    --validation-split 0.0 \
    --num-test-examples 0 \
    --resume_from_checkpoint epoch=2164-step=1355540.ckpt \
    --checkpoint-epochs 1 \
    --max_epochs 10000 \
    --precision 32

...

Missing key(s) in state_dict: "model_g.dec.cond.weight", "model_g.dec.cond.bias", "model_g.enc_q.enc.cond_layer.bias", "model_g.enc_q.enc.cond_layer.weight_g", "model_g.enc_q.enc.cond_layer.weight_v", "model_g.flow.flows.0.enc.cond_layer.bias", "model_g.flow.flows.0.enc.cond_layer.weight_g", "model_g.flow.flows.0.enc.cond_layer.weight_v", "model_g.flow.flows.2.enc.cond_layer.bias", "model_g.flow.flows.2.enc.cond_layer.weight_g", "model_g.flow.flows.2.enc.cond_layer.weight_v", "model_g.flow.flows.4.enc.cond_layer.bias", "model_g.flow.flows.4.enc.cond_layer.weight_g", "model_g.flow.flows.4.enc.cond_layer.weight_v", "model_g.flow.flows.6.enc.cond_layer.bias", "model_g.flow.flows.6.enc.cond_layer.weight_g", "model_g.flow.flows.6.enc.cond_layer.weight_v", "model_g.dp.cond.weight", "model_g.dp.cond.bias", "model_g.emb_g.weight".
synesthesiam commented 1 year ago

How many speakers are in your new dataset (specifically num_speakers in config.json)?

dic1911 commented 1 year ago

@synesthesiam I have 24 speakers for my dataset

mirfan899 commented 1 year ago

any update on this??? Having same issue.

hopkira commented 12 months ago

Hi, thanks for all the fantastic work on Piper - I'm hoping it will be just the thing to provide a local custom voice for my Pi 4 robot dog!

I think I may be seeing the same issue as those above? I only have the one speaker and have used the latest low quality 'lessac' checkpoint from huggingface . It would seem I have engineered a mismatch between the ckpt file model dimensions and the 'x-low':

RuntimeError: Error(s) in loading state_dict for VitsModel: The dimensional mismatches are: [256, 192] vs [256, 96] [1, 9, 96] vs [1, 9, 48] [192,192,1] vs [96,96,1] [192] vs [96]

So a repeating pattern of 192 -> 96 -> 48...

python3 -m piper_train \
    --dataset-dir /home/output/ \
    --accelerator 'gpu' \
    --devices 1 \
    --batch-size 32 \
    --validation-split 0.0 \
    --num-test-examples 0 \
    --max_epochs 3000 \
    --resume_from_checkpoint '/home/checkpoint/epoch=2307-step=558536.ckpt' \
    --checkpoint-epochs 1 \
    --quality 'x-low' \
    --precision 32

The num_symbols(256) and num_speakers (1), speaker_id_map {} and piper_version (1.0.0) all seem to align between the checkpoint and my training set.

So, apologies if I've made a stupid mistake! I'll try with the medium models, but a fast, relatively low quality voice would be ideal for my robot as he needs to be slightly robotic and fast responding.

Thanks for your time

krones9000 commented 9 months ago

Apologies for maybe confusing things but I raised an issue over on the Huggingface page that I think may be related: https://huggingface.co/datasets/rhasspy/piper-checkpoints/discussions/8

In short, I'm trying to use the libritts_r ckpt to finetune from as it's quite a good model.

I've been encountering similar issues to the above. As far as I can tell, the only mismatch between my data and the existing model is the number of speakers (904 vs 1). But I'm not sure if this is related to the failure to use the model checkpoint.

If you can offer any guidance as to how to build on top of the libritts_r model that would be massively appreciated.

krones9000 commented 9 months ago

From my HuggingFace comment:

It's the Speaker count.

From the config.json of the model: "num_speakers": 904,

I went through my training data. Duplicated it until there were 904 instances. Then I set up my metadata.csv as though there were 904 individual speakers for each line. And now I can finetune using the libritts_r model checkpoint. Epochs are incredibly slow compared to finetuning on other models. But at least I've confirmed how/why.

EDIT: You don't even need to duplicate your training data files. You can just list them more than once in your csv as if they were a new input.

raphaelmerx commented 4 months ago

I've been encountering similar issues to the above. As far as I can tell, the only mismatch between my data and the existing model is the number of speakers (904 vs 1). But I'm not sure if this is related to the failure to use the model checkpoint.

For training a multi-speaker model from a single-speaker model, you'll need to use the option --resume_from_single_speaker_checkpoint, see https://github.com/rhasspy/piper/blob/master/TRAINING.md#multi-speaker-fine-tuning

symdec commented 3 months ago

Hello, I have similar issues than the original message, using checkpoint from French voices (namely this one : https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main/fr/fr_FR/upmc/medium). Here is the error :

RuntimeError: Error(s) in loading state_dict for VitsModel:
        Unexpected key(s) in state_dict: "model_g.emb_g.weight", "model_g.dec.cond.weight", "model_g.dec.cond.bias", 
"model_g.enc_q.enc.cond_layer.bias", "model_g.enc_q.enc.cond_layer.weight_g", "model_g.enc_q.enc.cond_layer.weight_v", 
"model_g.flow.flows.0.enc.cond_layer.bias", "model_g.flow.flows.0.enc.cond_layer.weight_g", 
"model_g.flow.flows.0.enc.cond_layer.weight_v", "model_g.flow.flows.2.enc.cond_layer.bias", 
"model_g.flow.flows.2.enc.cond_layer.weight_g", "model_g.flow.flows.2.enc.cond_layer.weight_v", 
"model_g.flow.flows.4.enc.cond_layer.bias", "model_g.flow.flows.4.enc.cond_layer.weight_g", 
"model_g.flow.flows.4.enc.cond_layer.weight_v", "model_g.flow.flows.6.enc.cond_layer.bias", 
"model_g.flow.flows.6.enc.cond_layer.weight_g", "model_g.flow.flows.6.enc.cond_layer.weight_v", "model_g.dp.cond.weight", 
"model_g.dp.cond.bias". 

So, after reading the conversation above, I tried to start from en_US lessac-medium and I was able to start the fine-tuning process not seeing the runtime error. But I have a question.

Is it good practice to start from English checkpoint to get a French model at the end ? Otherwise, how can I start from a French checkpoint, avoiding the runtime error ?

Thanks in advance :)

TheStigh commented 1 week ago

Is it good practice to start from English checkpoint to get a French model at the end ? Otherwise, how can I start from a French checkpoint, avoiding the runtime error ?

Thanks in advance :)

Hi @symdec Did you ever figure out the answer to your questions as I have the same for Norwegian :)

symdec commented 1 week ago

Hello @TheStigh, unfortunately not :/ I only managed to fine-tune from the English model I mentioned :)

TheStigh commented 1 week ago

Hello @TheStigh, unfortunately not :/ I only managed to fine-tune from the English model I mentioned :)

How did the outcome perform after fine-tuning with the English model? Are you satisfied?

symdec commented 1 week ago

How did the outcome perform after fine-tuning with the English model? Are you satisfied?

I just performed a small experiment with not many audio files (~10 files of 10s) as dataset and 2-3hours of fine-tuning on 1 T4 GPU.

The result was not great but promising given the tiny dataset and the small finetuning duration. I didn't scale this experiment so I cannot tell you more about the performance of this process.

N.B.: it is possible and easy to extract the fine-tuned model after some epochs to test it and see how it behaves, and let it train longer to improve the result, etc. iteratively.