openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.73k stars 288 forks source link

Export Acoustic Model Error:"size mismatch for fs2.txt_embed.weight" #185

Closed Alistair-zhong closed 7 months ago

Alistair-zhong commented 7 months ago

python : 3.8.19 torch ver : 1.13.1 (Now) / 2.2.2+cu118 (Before) DiffSinger ver : 76afe57 (latest commit)

I get this error when I export acoustic model. And I trained the model on a cloud server, and then transferred the checkpoint files to export locally.

Traceback

 (most recent call last):
  File "scripts/export.py", line 294, in <module>
    main()
  File "c:\Users\asus\anaconda3\envs\sofa_gui\lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "c:\Users\asus\anaconda3\envs\sofa_gui\lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "c:\Users\asus\anaconda3\envs\sofa_gui\lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\Users\asus\anaconda3\envs\sofa_gui\lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\Users\asus\anaconda3\envs\sofa_gui\lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "scripts/export.py", line 137, in acoustic
    exporter = DiffSingerAcousticExporter(
  File "D:\niro-workspace\diffsinger\DiffSinger\deployment\exporters\acoustic_exporter.py", line 36, in __init__
    self.model = self.build_model()
  File "D:\niro-workspace\diffsinger\DiffSinger\deployment\exporters\acoustic_exporter.py", line 94, in build_model
    load_ckpt(model, hparams['work_dir'], ckpt_steps=self.ckpt_steps,
  File "D:\niro-workspace\diffsinger\DiffSinger\utils\__init__.py", line 216, in load_ckpt
    cur_model.load_state_dict(state_dict, strict=strict)
  File "c:\Users\asus\anaconda3\envs\sofa_gui\lib\site-packages\torch\nn\modules\module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DiffSingerAcousticONNX:
    size mismatch for fs2.txt_embed.weight: copying a param with shape torch.Size([45, 256]) from checkpoint, the shape in current model is torch.Size([47, 256]).

acoustic config.yaml

base_config: configs/acoustic.yaml

raw_data_dir:
  - data/english_dataset/raw
speakers:
  - english_dataset
spk_ids: []
test_prefixes:
  - 04-15-english-song1_seg003
  - 04-15-english-song1_seg010
  - 04-15-english-song1_seg017
  - 0416-english-song2_seg003
  - 0416-english-song2_seg0010
dictionary: dictionaries/tgm_sofa_dict.txt
binary_data_dir: data/english_dataset/binary
binarization_args:
  num_workers: 2
pe: parselmouth
pe_ckpt: null
vocoder: NsfHifiGAN
vocoder_ckpt: checkpoints/nsf_hifigan/model.ckpt

use_spk_id: false
num_spk: 1

# NOTICE: before enabling variance embeddings, please read the docs at
# https://github.com/openvpi/DiffSinger/tree/main/docs/BestPractices.md#choosing-variance-parameters
use_energy_embed: false
use_breathiness_embed: false
use_voicing_embed: false
use_tension_embed: false

use_key_shift_embed: true
use_speed_embed: true

augmentation_args:
  random_pitch_shifting:
    enabled: true
    range: [-5., 5.]
    scale: 0.75
  fixed_pitch_shifting:
    enabled: false
    targets: [-5., 5.]
    scale: 0.5
  random_time_stretching:
    enabled: true
    range: [0.5, 2.]
    scale: 0.75

residual_channels: 512
residual_layers: 20

# shallow diffusion
diffusion_type: reflow
use_shallow_diffusion: true
T_start: 0.4
T_start_infer: 0.4
K_step: 300
K_step_infer: 300
shallow_diffusion_args:
  train_aux_decoder: true
  train_diffusion: true
  val_gt_start: false
  aux_decoder_arch: convnext
  aux_decoder_args:
    num_channels: 512
    num_layers: 6
    kernel_size: 7
    dropout_rate: 0.1
  aux_decoder_grad: 0.1
lambda_aux_mel_loss: 0.2

optimizer_args:
  lr: 0.0006
lr_scheduler_args:
  scheduler_cls: torch.optim.lr_scheduler.StepLR
  step_size: 10000
  gamma: 0.75
max_batch_frames: 50000
max_batch_size: 64
max_updates: 160000

num_valid_plots: 5
val_with_vocoder: true
val_check_interval: 1000
num_ckpt_keep: 5
permanent_ckpt_start: 2000
permanent_ckpt_interval: 20000
pl_trainer_devices: 'auto'
pl_trainer_precision: '16-mixed'
Alistair-zhong commented 7 months ago

Resolved. Because the number of phonemes in the dictionary before and after training is inconsistent.