Closed laiyoi closed 1 year ago
你拿声学模型推理variance不报错才有鬼了,, variance模型要另外训练的,现在onnx方面也暂时没和openutau对接
那acoustic模型还有音高预测和因素预测的功能吗
训练variance模型是要把数据集的config的base_config改成variance.yaml再训练吗?
onnx导出acoustic也报错了,只有一个speaker
(diff) PS E:\DiffSinger> python .\scripts\export.py acoustic --exp 0627_liuchan_ds1000_23.06.26
| found ckpt by name: 0627_liuchan_ds1000_23.06.26
| Hparams chains: []
| Hparams:
K_step: 1000, accumulate_grad_batches: 1, audio_num_mel_bins: 128, audio_sample_rate: 44100, augmentation_args: {'fixed_pitch_shifting': {'enabled': True, 'scale': 0.75, 'targets': [-5.0, 5.0]}, 'random_pitch_shifting': {'enabled': False, 'range': [-5.0, 5.0], 'scale': 1.0}, 'random_time_stretching': {'domain': 'log', 'enabled': True, 'range': [0.65, 2.0], 'scale': 2.0}},
base_config: [], binarization_args: {'num_workers': 0, 'shuffle': True}, binarizer_cls: preprocessing.acoustic_binarizer.AcousticBinarizer, binary_data_dir: data/liuchan_23.06.26/binary, breathiness_smooth_width: 0.12,
clip_grad_norm: 1, dataloader_prefetch_factor: 2, ddp_backend: nccl, dictionary: dictionaries/opencpop-extension.txt, diff_accelerator: ddim,
diff_decoder_type: wavenet, diff_loss_type: l2, dilation_cycle_length: 4, dropout: 0.1, ds_workers: 4,
enc_ffn_kernel_size: 9, enc_layers: 4, energy_smooth_width: 0.12, exp_name: 0627_liuchan_ds1000_23.06.26, f0_embed_type: continuous,
ffn_act: gelu, ffn_padding: SAME, fft_size: 2048, fmax: 16000, fmin: 40,
hidden_size: 256, hop_size: 512, infer: True, interp_uv: True, log_interval: 100,
lr_scheduler_args: {'gamma': 0.5, 'scheduler_cls': 'torch.optim.lr_scheduler.StepLR', 'step_size': 52500, 'warmup_steps': 2000}, max_batch_frames: 80000, max_batch_size: 14, max_beta: 0.02, max_updates: 420000,
max_val_batch_frames: 60000, max_val_batch_size: 1, mel_vmax: 1.5, mel_vmin: -6.0, num_ckpt_keep: 2,
num_heads: 2, num_pad_tokens: 1, num_sanity_val_steps: 1, num_spk: 3, num_valid_plots: 10,
optimizer_args: {'beta1': 0.9, 'beta2': 0.98, 'lr': 0.00035, 'optimizer_cls': 'torch.optim.AdamW', 'weight_decay': 0}, permanent_ckpt_interval: 42000, permanent_ckpt_start: 120000, pl_trainer_accelerator: auto, pl_trainer_devices: auto,
pl_trainer_num_nodes: 1, pl_trainer_precision: 32-true, pl_trainer_strategy: auto, pndm_speedup: 10, raw_data_dir: ['data/liuchan_23.06.26/raw'],
rel_pos: True, residual_channels: 512, residual_layers: 20, sampler_frame_count_grid: 6, save_codes: ['configs', 'modules', 'training', 'utils'],
schedule_type: linear, seed: 1234, sort_by_len: True, speakers: ['liuchan'], spec_max: [0],
spec_min: [-5], spk_ids: [], task_cls: training.acoustic_task.AcousticTask, test_prefixes: ['p_1_jz yq_(Vocals)_1_cq_185', 'p_1_jz yq_(Vocals)_1_cq_208', 'p_1_jz yq_(Vocals)_1_cq_280', 'p_1_jz yq_(Vocals)_2_cq_211', 'p_1_jz yq_(Vocals)_3_cq_215', 'p_1_jz yq_(Vocals)_4_cq_146', 'p_1_jz yq_(Vocals)_4_cq_270', 'p_1_jz yq_(Vocals)_4_cq_271', 'p_1_jz yq_(Vocals)_6_cq_194', 'sample2_-4key_liuchan_0.5_sovdiff_1'], timesteps: 1000,
train_set_name: train, use_breathiness_embed: False, use_energy_embed: False, use_key_shift_embed: False, use_pos_embed: True,
use_speed_embed: True, use_spk_id: True, val_check_interval: 3000, val_with_vocoder: True, valid_set_name: valid,
vocoder: NsfHifiGAN, vocoder_ckpt: checkpoints/nsf_hifigan/model, win_size: 2048, work_dir: checkpoints\0627_liuchan_ds1000_23.06.26,
| Exporter: <class 'deployment.exporters.acoustic_exporter.DiffSingerAcousticExporter'>
| load phoneme set: ['AP', 'E', 'En', 'SP', 'a', 'ai', 'an', 'ang', 'ao', 'b', 'c', 'ch', 'd', 'e', 'ei', 'en', 'eng', 'er', 'f', 'g', 'h', 'i', 'i0', 'ia', 'ian', 'iang', 'iao', 'ie', 'in', 'ing', 'iong', 'ir', 'iu', 'j', 'k', 'l', 'm', 'n', 'o', 'ong', 'ou', 'p', 'q', 'r', 's', 'sh', 't', 'u', 'ua', 'uai', 'uan', 'uang', 'ui', 'un', 'uo', 'v', 'van', 've', 'vn', 'w', 'x', 'y', 'z', 'zh']
| load 'model' from 'checkpoints\0627_liuchan_ds1000_23.06.26\model_ckpt_steps_204000.ckpt'.
Traceback (most recent call last):
File ".\scripts\export.py", line 200, in <module>
main()
File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "E:\anaconda\envs\diff\lib\site-packages\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
File ".\scripts\export.py", line 117, in acoustic
exporter = DiffSingerAcousticExporter(
File "E:\DiffSinger\deployment\exporters\acoustic_exporter.py", line 59, in __init__
first_spk = next(self.spk_map.keys())
TypeError: 'dict_keys' object is not an iterator
虽然加上--freeze_spk可以解决但是根据
if len(self.spk_map) == 1:
# If there is only one speaker, freeze him/her.
来看这应该是不应该出现的吧
刚已经修了,更新下代码
那在transcriptions.csv标注了ph_num,note_seq,note_dur的情况下训练的acoustic模型有音高预测和因素预测的功能吗 训练variance模型是要把数据集的config的base_config改成variance.yaml再训练吗?
acoustic模型是用来出声的,只有variance模型才能选配音素时长和音素预测功能,要训练的话根据你的需求在variance.yaml基础上修改配置项就可以了,每个配置项的含义可以参考docs/ConfigurationSchemas.md
hello @yqzhishen, I had the same problem and hope you can correct me. Here is my step-by-step:
1. Extend acoustic datasets into variance datasets using MakeDiffSinger. This is the content of a .ds file:
{
"offset": 0.0,
"text": "SP j əː˦˨ n iː˨˦ w k ɛː˦˥ w k u˨ˀ˥ ŋm tʰ eː˦˥ x o˨˨ ŋm l aː˦˨ m ʔ ɛː˨˥ m k w a˨˨ j tɕ əː˨˩˨ l aː˨˩ j SP",
"ph_seq": "SP j əː˦˨ n iː˨˦ w k ɛː˦˥ w k u˨ˀ˥ ŋm tʰ eː˦˥ x o˨˨ ŋm l aː˦˨ m ʔ ɛː˨˥ m k w a˨˨ j tɕ əː˨˩˨ l aː˨˩ j SP",
"ph_dur": "1.083333 0.190000 0.130000 0.070000 0.090000 0.100000 0.130000 0.040000 0.100000 0.120000 0.060000 ...",
"ph_num": "2 2 3 3 3 2 3 3 4 3 2 2 1",
"note_seq": "rest G3-33 rest G3-32 G3-32 D4+2 D4 D4+1 D4+5 D4-12 C4-7 A#3+9 C4-11 F4+5 F4+5 G4-5 C4+9 D4 D4",
"note_dur": "1.044898 0.128435 0.100000 0.200000 0.066304 0.253696 0.260000 0.280000 0.280000 0.133741 0.136259 ...",
"note_slur": "0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0",
"f0_seq": "228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8 228.8...",
"f0_timestep": "0.011609977324263039"
}
]
And colums in transcriptions.csv file: name | ph_seq | ph_dur | ph_num | note_seq | note_dur
2. I put the data into the repo with the structure:
.
└── data
└── phd
├── ds
├── transcriptions.csv
└── wavs
3. Edit configurations and train STEP-1.
Change the configs for: variance model
and STEP 1: train the diffusion decoder
based on Best Practices
python scripts/binarize.py --config my_config.yaml
python scripts/train.py --config my_config.yaml --exp_name my_experiment --reset
(Actually, I feel very confused because the commands for Acoustic model and Variance model are the same!)4. Edit configurations and train STEP-2.
Change the configs for: variance model
and STEP-2: train the auxiliary decoder
based on Best Practices
python scripts/train.py --config my_config.yaml --exp_name my_experiment --reset
5. Finally, I use checkpoint in STEP-2 to export ONNX models.
6. I have Pytorch 1.13 set up, but ONNX cannot be exported:
python scripts/export.py variance --exp my_experiment
| Exporter: <class 'deployment.exporters.variance_exporter.DiffSingerVarianceExporter'>
| load phoneme set: ['AP', 'SP', 'aː˦˥', 'aː˦˨', 'aː˨ˀ˥', 'aː˨˥', 'aː˨˦', 'aː˨˨', 'aː˨˨ˀ', 'aː˨˩', 'aː˨˩ˀ', 'a˦˥', 'a˦˨', 'a˨ˀ˥', 'a˨˥', 'a˨˦', 'a˨˨', 'a˨˨ˀ', 'a˨˩', 'a˨˩ˀ', 'c', 'eː˦˥', 'eː˦˨', 'eː˨˥', 'eː˨˦', 'eː˨˨', 'eː˨˨ˀ', 'eː˨˩', 'e˦˥', 'e˨˨', 'f', 'h', 'iə˦˥', 'iə˨ˀ˥', 'iə˨˥', 'iə˨˦', 'iə˨˨', 'iə˨˩', 'iə˨˩ˀ', 'iə˨˩˨', 'iː˦˥', 'iː˦˨', 'iː˨ˀ˥', 'iː˨˥', 'iː˨˦', 'iː˨˨', 'iː˨˨ˀ', 'iː˨˩', 'iː˨˩ˀ', 'iː˨˩˨', 'i˦˥', 'i˦˨', 'i˨˥', 'i˨˦', 'i˨˨', 'i˨˨ˀ', 'i˨˩', 'i˨˩ˀ', 'i˨˩˨', 'j', 'k', 'kp', 'l', 'm', 'n', 'oː˦˥', 'oː˦˨', 'oː˨ˀ˥', 'oː˨˥', 'oː˨˦', 'oː˨˨', 'oː˨˨ˀ', 'oː˨˩', 'oː˨˩ˀ', 'oː˨˩˨', 'o˦˥', 'o˨ˀ˥', 'o˨˥', 'o˨˦', 'o˨˨', 'o˨˩', 'o˨˩ˀ', 'p', 'r', 's', 't', 'tɕ', 'tʰ', 'uə˦˥', 'uə˦˨', 'uə˨ˀ˥', 'uə˨˥', 'uə˨˦', 'uə˨˨', 'uə˨˨ˀ', 'uə˨˩', 'uə˨˩ˀ', 'uə˨˩˨', 'uː˦˥', 'uː˦˨', 'uː˨ˀ˥', 'uː˨˦', 'uː˨˨', 'uː˨˩ˀ', 'uː˨˩˨', 'u˦˥', 'u˦˨', 'u˨ˀ˥', 'u˨˥', 'u˨˨', 'u˨˨ˀ', 'u˨˩', 'v', 'w', 'x', 'ŋ', 'ŋm', 'ɓ', 'ɔː˦˥', 'ɔː˦˨', 'ɔː˨ˀ˥', 'ɔː˨˥', 'ɔː˨˨', 'ɔː˨˨ˀ', 'ɔː˨˩', 'ɔː˨˩ˀ', 'ɔ˦˥', 'ɔ˦˨', 'ɔ˨ˀ˥', 'ɔ˨˥', 'ɔ˨˨', 'ɔ˨˨ˀ', 'ɔ˨˩', 'ɔ˨˩ˀ', 'ɗ', 'əː˦˥', 'əː˦˨', 'əː˨ˀ˥', 'əː˨˥', 'əː˨˨', 'əː˨˨ˀ', 'əː˨˩', 'əː˨˩˨', 'ə˦˥', 'ə˦˨', 'ə˨ˀ˥', 'ə˨˥', 'ə˨˨', 'ə˨˨ˀ', 'ə˨˩', 'ɛː˦˥', 'ɛː˦˨', 'ɛː˨ˀ˥', 'ɛː˨˥', 'ɛː˨˨', 'ɛː˨˨ˀ', 'ɛː˨˩', 'ɡ', 'ɨə˦˥', 'ɨə˦˨', 'ɨə˨ˀ˥', 'ɨə˨˥', 'ɨə˨˦', 'ɨə˨˨', 'ɨə˨˨ˀ', 'ɨə˨˩', 'ɨə˨˩ˀ', 'ɨə˨˩˨', 'ɨː˦˥', 'ɨː˦˨', 'ɨː˨˥', 'ɨː˨˦', 'ɨː˨˨', 'ɨː˨˩', 'ɨː˨˩ˀ', 'ɨ˦˥', 'ɨ˦˨', 'ɨ˨ˀ˥', 'ɨ˨˥', 'ɨ˨˨ˀ', 'ɲ', 'ʔ']
Traceback (most recent call last):
File "/content/drive/MyDrive/DiffSinger/DiffSinger-main/scripts/export.py", line 244, in <module>
main()
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/content/drive/MyDrive/DiffSinger/DiffSinger-main/scripts/export.py", line 175, in variance
exporter = DiffSingerVarianceExporter(
File "/content/drive/MyDrive/DiffSinger/DiffSinger-main/deployment/exporters/variance_exporter.py", line 36, in __init__
self.model = self.build_model()
File "/content/drive/MyDrive/DiffSinger/DiffSinger-main/deployment/exporters/variance_exporter.py", line 85, in build_model
model = DiffSingerVarianceONNX(
File "/content/drive/MyDrive/DiffSinger/DiffSinger-main/deployment/modules/toplevel.py", line 88, in __init__
super().__init__(vocab_size=vocab_size)
File "/content/drive/MyDrive/DiffSinger/DiffSinger-main/modules/toplevel.py", line 119, in __init__
self.predict_dur = hparams['predict_dur']
KeyError: 'predict_dur'
Thank you very much if you take the time to read and respond!
@blizzard090
In your step 3 and 4,
Actually, I feel very confused because the commands for Acoustic model and Variance model are the same!
Configuration files have binarizer and trainer classes defined in them, so if you inherit from the correct base config, binarize.py and train.py will recognize the model type correctly. The most important thing is that you yourself should know clearly about what type of model you are training! Also, only acoustic models have shallow diffusion profiles, so the docs you listed are not for variance models.
In your step 6,
Again, please figure out what type of model you are exporting. The error is raised because you tried to export an acoustic model with the variance exporter, and acoustic models do not contain the key predict_dur
.
Since this issue has been closed, please raise a new issue if you meet further problems.
@blizzard090
In your step 3 and 4:
Actually, I feel very confused because the commands for Acoustic model and Variance model are the same!
Configuration files have binarizer and trainer classes defined in them, so if you inherit from the correct base config, binarize.py and train.py will recognize the model type correctly. The most important thing is that you yourself should know clearly about what type of model you are training!
Also, only acoustic models have shallow diffusion profiles, so the docs you listed are not for variance models.
In your step 6:
Again, please figure out what type of model you are exporting. The error is raised because you tried to export an acoustic model with the variance exporter, and acoustic models do not contain the key predict_dur
.
Since this issue has been closed, please raise a new issue if you meet further problems.
--predict这个参数填什么 看起来貌似是预测模型的预测参数 填pitch会报错