openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.62k stars 275 forks source link

如何推理variance模型? #103

Closed laiyoi closed 1 year ago

laiyoi commented 1 year ago

--predict这个参数填什么 看起来貌似是预测模型的预测参数 填pitch会报错


| Hparams:
K_step: 1000, accumulate_grad_batches: 1, audio_num_mel_bins: 128, audio_sample_rate: 44100, augmentation_args: {'fixed_pitch_shifting': {'enabled': True, 'scale': 0.75, 'targets': [-5.0, 5.0]}, 'random_pitch_shifting': {'enabled': False, 'range': [-5.0, 5.0], 'scale': 1.0}, 'random_time_stretching': {'domain': 'log', 'enabled': True, 'range': [0.65, 2.0], 'scale': 2.0}},
base_config: [], binarization_args: {'num_workers': 0, 'shuffle': True}, binarizer_cls: preprocessing.acoustic_binarizer.AcousticBinarizer, binary_data_dir: data/liuchan_23.06.26/binary, breathiness_smooth_width: 0.12,
clip_grad_norm: 1, dataloader_prefetch_factor: 2, ddp_backend: nccl, dictionary: dictionaries/opencpop-extension.txt, diff_accelerator: ddim,
diff_decoder_type: wavenet, diff_loss_type: l2, dilation_cycle_length: 4, dropout: 0.1, ds_workers: 4,
enc_ffn_kernel_size: 9, enc_layers: 4, energy_smooth_width: 0.12, exp_name: 0627_liuchan_ds1000_23.06.26, f0_embed_type: continuous,
ffn_act: gelu, ffn_padding: SAME, fft_size: 2048, fmax: 16000, fmin: 40,
hidden_size: 256, hop_size: 512, infer: True, interp_uv: True, log_interval: 100,
lr_scheduler_args: {'gamma': 0.5, 'scheduler_cls': 'torch.optim.lr_scheduler.StepLR', 'step_size': 52500, 'warmup_steps': 2000}, max_batch_frames: 80000, max_batch_size: 14, max_beta: 0.02, max_updates: 420000,
max_val_batch_frames: 60000, max_val_batch_size: 1, mel_vmax: 1.5, mel_vmin: -6.0, num_ckpt_keep: 2,
num_heads: 2, num_pad_tokens: 1, num_sanity_val_steps: 1, num_spk: 3, num_valid_plots: 10,
optimizer_args: {'beta1': 0.9, 'beta2': 0.98, 'lr': 0.00035, 'optimizer_cls': 'torch.optim.AdamW', 'weight_decay': 0}, permanent_ckpt_interval: 42000, permanent_ckpt_start: 120000, pl_trainer_accelerator: auto, pl_trainer_devices: auto,
pl_trainer_num_nodes: 1, pl_trainer_precision: 32-true, pl_trainer_strategy: auto, pndm_speedup: 10, raw_data_dir: ['data/liuchan_23.06.26/raw'],
rel_pos: True, residual_channels: 512, residual_layers: 20, sampler_frame_count_grid: 6, save_codes: ['configs', 'modules', 'training', 'utils'],
schedule_type: linear, seed: 1234, sort_by_len: True, speakers: ['liuchan'], spec_max: [0],
spec_min: [-5], spk_ids: [], task_cls: training.acoustic_task.AcousticTask, test_prefixes: ['p_1_jz yq_(Vocals)_1_cq_185', 'p_1_jz yq_(Vocals)_1_cq_208', 'p_1_jz yq_(Vocals)_1_cq_280', 'p_1_jz yq_(Vocals)_2_cq_211', 'p_1_jz yq_(Vocals)_3_cq_215', 'p_1_jz yq_(Vocals)_4_cq_146', 'p_1_jz yq_(Vocals)_4_cq_270', 'p_1_jz yq_(Vocals)_4_cq_271', 'p_1_jz yq_(Vocals)_6_cq_194', 'sample2_-4key_liuchan_0.5_sovdiff_1'], timesteps: 1000,
train_set_name: train, use_breathiness_embed: False, use_energy_embed: False, use_key_shift_embed: False, use_pos_embed: True,
use_speed_embed: True, use_spk_id: True, val_check_interval: 3000, val_with_vocoder: True, valid_set_name: valid,
vocoder: NsfHifiGAN, vocoder_ckpt: checkpoints/nsf_hifigan/model, win_size: 2048, work_dir: checkpoints\0627_liuchan_ds1000_23.06.26,
', 'ei', 'en', 'eng', 'er', 'f', 'g', 'h', 'i', 'i0', 'ia', 'ian', 'iang', 'iao', 'ie', 'in', 'ing', 'iong', 'ir', 'iu', 'j', 'k', 'l', 'm', 'n', 'o', 'ong', 'ou', 'p', 'q', 'r', 's', 'sh', 't', 'u', 'ua', 'uai', 'uan', 'uang', 'ui', 'un', 'uo', 'v', 'van', 've', 'vn', 'w', 'x', 'y', 'z', 'zh']
Traceback (most recent call last):
  File ".\scripts\infer.py", line 218, in <module>
    main()
  File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File ".\scripts\infer.py", line 208, in variance
    infer_ins = DiffSingerVarianceInfer(ckpt_steps=ckpt, predictions=set(predict))
  File "E:\DiffSinger\inference\ds_variance.py", line 42, in __init__
    self.model: DiffSingerVariance = self.build_model(ckpt_steps=ckpt_steps)
  File "E:\DiffSinger\inference\ds_variance.py", line 67, in build_model
    model = DiffSingerVariance(
  File "E:\DiffSinger\modules\toplevel.py", line 71, in __init__
    self.predict_dur = hparams['predict_dur']
KeyError: 'predict_dur'```
填dur也不行
另外问一下导出成onnx模型后openutau支持这个预测吗
yqzhishen commented 1 year ago

你拿声学模型推理variance不报错才有鬼了,, variance模型要另外训练的,现在onnx方面也暂时没和openutau对接

laiyoi commented 1 year ago

那acoustic模型还有音高预测和因素预测的功能吗

laiyoi commented 1 year ago

训练variance模型是要把数据集的config的base_config改成variance.yaml再训练吗?

laiyoi commented 1 year ago

onnx导出acoustic也报错了,只有一个speaker

(diff) PS E:\DiffSinger> python .\scripts\export.py acoustic --exp 0627_liuchan_ds1000_23.06.26
| found ckpt by name: 0627_liuchan_ds1000_23.06.26
| Hparams chains:  []
| Hparams:
K_step: 1000, accumulate_grad_batches: 1, audio_num_mel_bins: 128, audio_sample_rate: 44100, augmentation_args: {'fixed_pitch_shifting': {'enabled': True, 'scale': 0.75, 'targets': [-5.0, 5.0]}, 'random_pitch_shifting': {'enabled': False, 'range': [-5.0, 5.0], 'scale': 1.0}, 'random_time_stretching': {'domain': 'log', 'enabled': True, 'range': [0.65, 2.0], 'scale': 2.0}},
base_config: [], binarization_args: {'num_workers': 0, 'shuffle': True}, binarizer_cls: preprocessing.acoustic_binarizer.AcousticBinarizer, binary_data_dir: data/liuchan_23.06.26/binary, breathiness_smooth_width: 0.12,
clip_grad_norm: 1, dataloader_prefetch_factor: 2, ddp_backend: nccl, dictionary: dictionaries/opencpop-extension.txt, diff_accelerator: ddim,
diff_decoder_type: wavenet, diff_loss_type: l2, dilation_cycle_length: 4, dropout: 0.1, ds_workers: 4,
enc_ffn_kernel_size: 9, enc_layers: 4, energy_smooth_width: 0.12, exp_name: 0627_liuchan_ds1000_23.06.26, f0_embed_type: continuous,
ffn_act: gelu, ffn_padding: SAME, fft_size: 2048, fmax: 16000, fmin: 40,
hidden_size: 256, hop_size: 512, infer: True, interp_uv: True, log_interval: 100,
lr_scheduler_args: {'gamma': 0.5, 'scheduler_cls': 'torch.optim.lr_scheduler.StepLR', 'step_size': 52500, 'warmup_steps': 2000}, max_batch_frames: 80000, max_batch_size: 14, max_beta: 0.02, max_updates: 420000,
max_val_batch_frames: 60000, max_val_batch_size: 1, mel_vmax: 1.5, mel_vmin: -6.0, num_ckpt_keep: 2,
num_heads: 2, num_pad_tokens: 1, num_sanity_val_steps: 1, num_spk: 3, num_valid_plots: 10,
optimizer_args: {'beta1': 0.9, 'beta2': 0.98, 'lr': 0.00035, 'optimizer_cls': 'torch.optim.AdamW', 'weight_decay': 0}, permanent_ckpt_interval: 42000, permanent_ckpt_start: 120000, pl_trainer_accelerator: auto, pl_trainer_devices: auto,
pl_trainer_num_nodes: 1, pl_trainer_precision: 32-true, pl_trainer_strategy: auto, pndm_speedup: 10, raw_data_dir: ['data/liuchan_23.06.26/raw'],
rel_pos: True, residual_channels: 512, residual_layers: 20, sampler_frame_count_grid: 6, save_codes: ['configs', 'modules', 'training', 'utils'],
schedule_type: linear, seed: 1234, sort_by_len: True, speakers: ['liuchan'], spec_max: [0],
spec_min: [-5], spk_ids: [], task_cls: training.acoustic_task.AcousticTask, test_prefixes: ['p_1_jz yq_(Vocals)_1_cq_185', 'p_1_jz yq_(Vocals)_1_cq_208', 'p_1_jz yq_(Vocals)_1_cq_280', 'p_1_jz yq_(Vocals)_2_cq_211', 'p_1_jz yq_(Vocals)_3_cq_215', 'p_1_jz yq_(Vocals)_4_cq_146', 'p_1_jz yq_(Vocals)_4_cq_270', 'p_1_jz yq_(Vocals)_4_cq_271', 'p_1_jz yq_(Vocals)_6_cq_194', 'sample2_-4key_liuchan_0.5_sovdiff_1'], timesteps: 1000,
train_set_name: train, use_breathiness_embed: False, use_energy_embed: False, use_key_shift_embed: False, use_pos_embed: True,
use_speed_embed: True, use_spk_id: True, val_check_interval: 3000, val_with_vocoder: True, valid_set_name: valid,
vocoder: NsfHifiGAN, vocoder_ckpt: checkpoints/nsf_hifigan/model, win_size: 2048, work_dir: checkpoints\0627_liuchan_ds1000_23.06.26,
| Exporter: <class 'deployment.exporters.acoustic_exporter.DiffSingerAcousticExporter'>
| load phoneme set: ['AP', 'E', 'En', 'SP', 'a', 'ai', 'an', 'ang', 'ao', 'b', 'c', 'ch', 'd', 'e', 'ei', 'en', 'eng', 'er', 'f', 'g', 'h', 'i', 'i0', 'ia', 'ian', 'iang', 'iao', 'ie', 'in', 'ing', 'iong', 'ir', 'iu', 'j', 'k', 'l', 'm', 'n', 'o', 'ong', 'ou', 'p', 'q', 'r', 's', 'sh', 't', 'u', 'ua', 'uai', 'uan', 'uang', 'ui', 'un', 'uo', 'v', 'van', 've', 'vn', 'w', 'x', 'y', 'z', 'zh']
| load 'model' from 'checkpoints\0627_liuchan_ds1000_23.06.26\model_ckpt_steps_204000.ckpt'.
Traceback (most recent call last):
  File ".\scripts\export.py", line 200, in <module>
    main()
  File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File ".\scripts\export.py", line 117, in acoustic
    exporter = DiffSingerAcousticExporter(
  File "E:\DiffSinger\deployment\exporters\acoustic_exporter.py", line 59, in __init__
    first_spk = next(self.spk_map.keys())
TypeError: 'dict_keys' object is not an iterator

虽然加上--freeze_spk可以解决但是根据

if len(self.spk_map) == 1:
                    # If there is only one speaker, freeze him/her.

来看这应该是不应该出现的吧

yqzhishen commented 1 year ago

刚已经修了,更新下代码

laiyoi commented 1 year ago

那在transcriptions.csv标注了ph_num,note_seq,note_dur的情况下训练的acoustic模型有音高预测和因素预测的功能吗 训练variance模型是要把数据集的config的base_config改成variance.yaml再训练吗?

yqzhishen commented 1 year ago

acoustic模型是用来出声的,只有variance模型才能选配音素时长和音素预测功能,要训练的话根据你的需求在variance.yaml基础上修改配置项就可以了,每个配置项的含义可以参考docs/ConfigurationSchemas.md

blizzard090 commented 3 months ago

hello @yqzhishen, I had the same problem and hope you can correct me. Here is my step-by-step:

1. Extend acoustic datasets into variance datasets using MakeDiffSinger. This is the content of a .ds file:

    {
        "offset": 0.0,
        "text": "SP j əː˦˨ n iː˨˦ w k ɛː˦˥ w k u˨ˀ˥ ŋm tʰ eː˦˥ x o˨˨ ŋm l aː˦˨ m ʔ ɛː˨˥ m k w a˨˨ j tɕ əː˨˩˨ l aː˨˩ j SP",
        "ph_seq": "SP j əː˦˨ n iː˨˦ w k ɛː˦˥ w k u˨ˀ˥ ŋm tʰ eː˦˥ x o˨˨ ŋm l aː˦˨ m ʔ ɛː˨˥ m k w a˨˨ j tɕ əː˨˩˨ l aː˨˩ j SP",
        "ph_dur": "1.083333 0.190000 0.130000 0.070000 0.090000 0.100000 0.130000 0.040000 0.100000 0.120000 0.060000 ...",
        "ph_num": "2 2 3 3 3 2 3 3 4 3 2 2 1",
        "note_seq": "rest G3-33 rest G3-32 G3-32 D4+2 D4 D4+1 D4+5 D4-12 C4-7 A#3+9 C4-11 F4+5 F4+5 G4-5 C4+9 D4 D4",
        "note_dur": "1.044898 0.128435 0.100000 0.200000 0.066304 0.253696 0.260000 0.280000 0.280000 0.133741 0.136259 ...",
        "note_slur": "0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0",
        "f0_seq": "228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8...",
        "f0_timestep": "0.011609977324263039"
    }
]

And colums in transcriptions.csv file: name | ph_seq | ph_dur | ph_num | note_seq | note_dur

2. I put the data into the repo with the structure:

.
└── data
    └── phd
        ├── ds
        ├── transcriptions.csv
        └── wavs

3. Edit configurations and train STEP-1. Change the configs for: variance model and STEP 1: train the diffusion decoder based on Best Practices

4. Edit configurations and train STEP-2. Change the configs for: variance model and STEP-2: train the auxiliary decoder based on Best Practices

5. Finally, I use checkpoint in STEP-2 to export ONNX models.

6. I have Pytorch 1.13 set up, but ONNX cannot be exported:

yqzhishen commented 3 months ago

@blizzard090

In your step 3 and 4,

Actually, I feel very confused because the commands for Acoustic model and Variance model are the same!

Configuration files have binarizer and trainer classes defined in them, so if you inherit from the correct base config, binarize.py and train.py will recognize the model type correctly. The most important thing is that you yourself should know clearly about what type of model you are training! Also, only acoustic models have shallow diffusion profiles, so the docs you listed are not for variance models.

In your step 6,

Again, please figure out what type of model you are exporting. The error is raised because you tried to export an acoustic model with the variance exporter, and acoustic models do not contain the key predict_dur.

Since this issue has been closed, please raise a new issue if you meet further problems.

yqzhishen commented 3 months ago

@blizzard090

In your step 3 and 4:

Actually, I feel very confused because the commands for Acoustic model and Variance model are the same!

Configuration files have binarizer and trainer classes defined in them, so if you inherit from the correct base config, binarize.py and train.py will recognize the model type correctly. The most important thing is that you yourself should know clearly about what type of model you are training!

Also, only acoustic models have shallow diffusion profiles, so the docs you listed are not for variance models.

In your step 6:

Again, please figure out what type of model you are exporting. The error is raised because you tried to export an acoustic model with the variance exporter, and acoustic models do not contain the key predict_dur.

Since this issue has been closed, please raise a new issue if you meet further problems.