openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.62k stars 275 forks source link

Different timbres from the same singer. seperated into unique speakers, all sound identical in a multi-speaker model #158

Closed spicytigermeat closed 7 months ago

spicytigermeat commented 7 months ago

Hi, I've been having this issue for quite some time and have tried a ton of different things to resolve it, with no luck. I've been training some English models with multi-speaker, there is about 3 characters and 12 speaker's total in the model. Each character sounds distinct from each-other, but each separate tone ends up sounding exactly the same (example: soft/power have the exact same synthesized timbre despite the source recordings sounding unique from each-other). I've tried re-writing the configuration, setting up the repo for training from scratch, removing about 3hrs of audio from the dataset, and nothing has changed. I'll include my acoustic and base configurations below. Any help is appreciated :)

Note: I'm using a custom fine-tuned vocoder for validation

main acoustic config (tgm_acoustic_leif.yaml) ```yaml base_config: - configs/base.yaml task_cls: training.acoustic_task.AcousticTask num_spk: 13 speakers: # commented numbers are the index of the speaker - tiger_fresh #0 - triton_gale #1 - canary_core #2 - leif_blossom_e #3 - leif_blossom_j #4 - leif_lush_e #5 - leif_lush_j #6 - leif_uprooted_e #7 - leif_uprooted_j #8 - leif_petal_e #9 - leif_petal_j #10 - tiger_disco #11 - tiger_electric #12 #- ritsu #13 raw_data_dir: - data/training_data/tiger_fresh #0 - data/training_data/triton_gale #1 - data/training_data/canary_core #2 - data/training_data/leif_blossom_e #4 - data/training_data/leif_blossom_j #5 - data/training_data/leif_lush_e #6 - data/training_data/leif_lush_j #7 - data/training_data/leif_uprooted_e #8 - data/training_data/leif_uprooted_j #9 - data/training_data/leif_petal_e #10 - data/training_data/leif_petal_j #11 - data/training_data/tiger_disco #13 - data/training_data/tiger_electric #14 #- data/training_data/ritsu #22 spk_ids: [] test_prefixes: # tiger_fresh - 0:familiar_seg016 - 0:golden_hour_seg000 - 0:rougenodengon_seg006 - 0:sungoesdown_seg007 - 0:videogames_seg002 # triton_gale - 1:natalie_dont_seg000 - 1:housewife_seg000 - 1:blinding_lights_seg002 - 1:surround_me_seg002 - 1:your_power_seg000 # canary_core - 2:intergalactia_seg009 - 2:still_alive_seg014 - 2:cyber_angel_seg003 - 2:canary_t2_02_seg003 - 2:canary_t2_02_seg000 # leif_blossom_e - 3:leif_blossom_06_seg000 - 3:leif_blossom_17_seg000 - 3:leif_blossom_11_seg001 - 3:leif_blossom_24_seg000 - 3:leif_blossom_36_seg002 # leif_blossom_j - 4:leif_blossom_j_05_seg000 - 4:leif_blossom_j_12_seg004 - 4:leif_blossom_j_10_seg000 - 4:leif_blossom_j_13_seg003 - 4:leif_blossom_j_15_seg001 # leif_lush_e - 5:leif_lush_04_seg000 - 5:leif_lush_10_seg000 - 5:leif_lush_11_seg001 - 5:leif_lush_24_seg002 - 5:leif_lush_33_seg000 # leif_lush_j - 6:leif_lush_j_02_seg001 - 6:leif_lush_j_07_seg000 - 6:leif_lush_j_06_seg002 - 6:leif_lush_j_13_seg002 - 6:leif_lush_j_18_seg001 # leif_uprooted_e - 7:leif_uprooted_01_seg001 - 7:leif_uprooted_05_seg000 - 7:leif_uprooted_10_seg000 - 7:leif_uprooted_17_seg003 - 7:leif_uprooted_20_seg000 # leif_uprooted_j - 8:leif_uprooted_j_08_seg003 - 8:leif_uprooted_j_13_seg000 - 8:leif_uprooted_j_03_seg001 - 8:leif_uprooted_j_08_seg000 - 8:leif_uprooted_j_12_seg003 # leif_petal_e - 9:leif_petal_08_seg001 - 9:leif_petal_11_seg002 - 9:leif_petal_01_seg001 - 9:leif_petal_10_seg002 - 9:leif_petal_20_seg004 # leif_petal_j - 10:leif_petal_j_04_seg001 - 10:leif_petal_j_07_seg003 - 10:leif_petal_j_04_seg003 - 10:leif_petal_j_13_seg003 - 10:leif_petal_j_15_seg003 # tiger_disco - 11:afternoon_in_heaven_seg008 - 11:i_still_wanna_know_seg009 - 11:dreamsweet_seg009 - 11:fireflies_seg008 - 11:so_cold_seg003 # tiger_electric - 12:funky_again_seg008 - 12:independant_together_seg013 - 12:rightround_seg013 - 12:you_and_i_9_seg000 - 12:still_feel_seg011 # ritsu #- 13:sakura_seg011 #- 13:tsutsuuraura_seg004 #- 13:traumerei_seg009 #- 13:WAVE_seg001 #- 13:Worlds_End_Celebrate_seg003 # stock_data #- 21:Blank_S_1_seg001 #- 21:Spotless_M_4_seg000 vocoder: NsfHifiGAN vocoder_ckpt: checkpoints/tgm_hifigan/generator.ckpt audio_sample_rate: 44100 audio_num_mel_bins: 128 hop_size: 512 # Hop size. fft_size: 2048 # FFT size. win_size: 2048 # FFT size. fmin: 40 fmax: 16000 binarization_args: shuffle: true num_workers: 0 #default: 0 augmentation_args: random_pitch_shifting: enabled: false range: [-5., 5.] scale: 1.0 fixed_pitch_shifting: enabled: false targets: [-5., 5.] scale: 0.75 random_time_stretching: enabled: false range: [0.5, 2.] domain: log # or linear scale: 1.0 binary_data_dir: data/binary/tgm_b04/acoustic_binary_4 binarizer_cls: preprocessing.acoustic_binarizer.AcousticBinarizer dictionary: dictionaries/tgm_dictionary_norx.txt num_pad_tokens: 1 spec_min: [-5] spec_max: [0] mel_vmin: -6. #-6. mel_vmax: 1.5 interp_uv: true energy_smooth_width: 0.12 breathiness_smooth_width: 0.12 use_spk_id: true f0_embed_type: discrete use_energy_embed: false use_breathiness_embed: false use_key_shift_embed: false use_speed_embed: false timesteps: 1000 max_beta: 0.02 rel_pos: true diff_accelerator: ddim pndm_speedup: 10 hidden_size: 256 residual_layers: 20 residual_channels: 512 dilation_cycle_length: 4 # * diff_decoder_type: 'wavenet' diff_loss_type: l1 schedule_type: 'linear' # shallow diffusion use_shallow_diffusion: true K_step: 400 K_step_infer: 400 shallow_diffusion_args: train_aux_decoder: true train_diffusion: true val_gt_start: true aux_decoder_arch: convnext aux_decoder_args: num_channels: 512 num_layers: 6 kernel_size: 7 dropout_rate: 0.1 aux_decoder_grad: 0.1 lambda_aux_mel_loss: 0.2 # train and eval num_sanity_val_steps: 1 optimizer_args: optimizer_cls: torch.optim.AdamW lr: 0.0004 beta1: 0.9 beta2: 0.98 weight_decay: 0 lr_scheduler_args: scheduler_cls: torch.optim.lr_scheduler.StepLR warmup_steps: 10000 step_size: 15000 gamma: 0.5 max_batch_frames: 80000 max_batch_size: 16 dataset_size_key: 'lengths' val_with_vocoder: true val_check_interval: 2000 num_valid_plots: 10 max_updates: 320000 num_ckpt_keep: 20 permanent_ckpt_start: 200000 permanent_ckpt_interval: 40000 finetune_enabled: true finetune_ckpt_path: checkpoints/tgm_acou_b04-2/ARCHIVE_CKPT/model_ckpt_steps_47500.ckpt finetune_ignored_params: - model.fs2.encoder.embed_tokens - model.fs2.txt_embed - model.fs2.spk_embed finetune_strict_shapes: true freezing_enabled: false frozen_params: [] use_melody_encoder: true use_glide_embed: false ```
base config (base.yaml) ```yaml # task task_cls: '' seed: 1234 save_codes: - configs - modules - training - utils ############# # dataset ############# sort_by_len: true raw_data_dir: '' binary_data_dir: '' binarizer_cls: '' binarization_args: shuffle: false num_workers: 0 audio_num_mel_bins: 128 audio_sample_rate: 44100 hop_size: 512 # For 22050Hz, 275 ~= 12.5 ms (0.0125 * sample_rate) win_size: 2048 # For 22050Hz, 1100 ~= 50 ms (If None, win_size: fft_size) (0.05 * sample_rate) fmin: 40 # Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To test depending on dataset. Pitch info: male~[65, 260], female~[100, 525]) fmax: 16000 # To be increased/reduced depending on data. fft_size: 2048 # Extra window size is filled with 0 paddings to match this parameter mel_vmin: -6 mel_vmax: 1.5 sampler_frame_count_grid: 6 ds_workers: 4 dataloader_prefetch_factor: 2 ######### # model ######### hidden_size: 256 dropout: 0.1 use_pos_embed: true enc_layers: 4 num_heads: 2 enc_ffn_kernel_size: 9 ffn_act: gelu ffn_padding: 'SAME' use_spk_id: false ########### # optimization ########### optimizer_args: optimizer_cls: torch.optim.AdamW lr: 0.0004 beta1: 0.9 beta2: 0.98 weight_decay: 0 lr_scheduler_args: scheduler_cls: torch.optim.lr_scheduler.StepLR step_size: 50000 gamma: 0.5 clip_grad_norm: 1 ########### # train and eval ########### num_ckpt_keep: 5 accumulate_grad_batches: 1 log_interval: 100 num_sanity_val_steps: 1 # steps of validation at the beginning val_check_interval: 2000 max_updates: 120000 max_batch_frames: 32000 max_batch_size: 100000 max_val_batch_frames: 60000 max_val_batch_size: 1 train_set_name: 'train' valid_set_name: 'valid' pe: 'rmvpe' pe_ckpt: checkpoints/rmvpe/model.pt vocoder: '' vocoder_ckpt: '' num_valid_plots: 10 ########### # pytorch lightning # Read https://lightning.ai/docs/pytorch/stable/common/trainer.html#trainer-class-api for possible values ########### pl_trainer_accelerator: 'auto' pl_trainer_devices: 'auto' pl_trainer_precision: '16-mixed' pl_trainer_num_nodes: 1 pl_trainer_strategy: name: auto process_group_backend: nccl find_unused_parameters: false nccl_p2p: true ########### # finetune ########### finetune_enabled: false finetune_ckpt_path: null finetune_ignored_params: [] finetune_strict_shapes: true freezing_enabled: false frozen_params: [] ```
yqzhishen commented 7 months ago

The vocoder has nothing to do with the timbre. Do different timbres from the same singer sound the same also on TensorBoard? How different do the timbres sound from each other?

Maybe irrelevant, but your configuration have many improper values. Please copy the template configuration and edit it, do not edit any pre-existing files, and do not derive from base.yaml directly, as introduced in the documentation. Do not use fine-tuning except for extremely special cases. Enable augmentation. Enable AMP. Use larger batch size.

spicytigermeat commented 7 months ago

The vocoder has nothing to do with the timbre. Do different timbres from the same singer sound the same also on TensorBoard? How different do the timbres sound from each other?

I only mentioned the vocoder to cover all of the differences. No, the different timbres sound very similar to the ground truth sample in tensorboard (so long as training is far enough along)

Maybe irrelevant, but your configuration have many improper values. Please copy the template configuration and edit it, do not edit any pre-existing files, and do not derive from base.yaml directly, as introduced in the documentation. Do not use fine-tuning except for extremely special cases. Enable augmentation. Enable AMP. Use larger batch size.

Okay, I'll readjust the configuration using your recommendations and see if I get better results! In what cases would you recommend using fine-tuning for?

yqzhishen commented 7 months ago

No, the different timbres sound very similar to the ground truth sample in tensorboard (so long as training is far enough along)

So you mean the timbres are distinct from each other on TensorBoard but are very similar in OpenUTAU? The only possibility I can imagine is that if you forgot to split the timbres when training the variance model (energy & breathiness) like how you trained your acoustic model, the timbres can be mixed up. But in your configuration I saw you did not enable these two parameters.

In what cases would you recommend using fine-tuning for?

Currently the only recommended usecase is about training the aux decoder and diffusion decoder separately, when enabling shallow diffusion. Fine-tuning is not that helpful in regular cases. If you fine-tune a model, it will not save much training steps if you want to totally wash out the timbres in the pre-trained model. If you train enough steps, it will cause catastrophic forgetting. If you discard some layers or embeddings before fine-tuning, it may perform even worse than starting from scratch. Meanwhile, fine-tuning requires careful adjustment in the training-related hyperparameters to get the best results. In a word, do not use fine-tuning unless guided by the documentation or unless you are expert and are clearly aware of what you are doing. And especially, for people who own enough high-quality and well labeled datasets, please train from scratch.

spicytigermeat commented 7 months ago

So you mean the timbres are distinct from each other on TensorBoard but are very similar in OpenUTAU? The only possibility I can imagine is that if you forgot to split the timbres when training the variance model (energy & breathiness) like how you trained your acoustic model, the timbres can be mixed up. But in your configuration I saw you did not enable these two parameters.

Yes, they sound distinct in TensorBoard but almost identical in OpenUTAU. There are slight differences with waveforms but generally all of the unique timbre gets removed and they all sound like they've been trained together as opposed to separate speakers. I generally wasn't happy with the results I got with energy and breathiness prior, so I decided to not train using those parameters. Do you think that might have something to do with this issue?

Thank you for better explaining the use of finetuning! I'll be sure to stick to tuning from scratch going forward.

The only other thing I can think that might be causing this issue is the amount of data I'm using and the amount of speakers? I never had this issue when I was training smaller amounts of data (~2hrs, 2 different vocalists, 6 different "voice modes"/speakers in diffsinger) and now my dataset is ~6hrs, 6 different vocalists and up to 23 different speakers in diffsinger. That's about all my GPU can handle (I train locally).

I ran another test last night training only about 2hrs of data across 3 vocalists and 10 speakers in the config, and still got the same issue after 200epochs/12k steps of acoustic training. I can confirm the data is high quality and tagged well (all done by me by hand). Thanks so much for all of your help!

yqzhishen commented 7 months ago

I mean, if the timbres are distinct from each other on TensorBoard, they are expected to be distinct from each other in OpenUTAU as well - because the conditions are the same. If you do believe the TensorBoard samples really sound as you expected, then there must be something wrong; otherwise, they should have been in trouble when they were on TensorBoard.

A possible way for debugging is to export the DS files from OpenUTAU, and use python scripts/infer.py acoustic your_project.ds --spk your_spk to verify if the model is really correctly trained.

I personally train with 5 or 6 vocalists and ~9 timbres in total for my every experiment. I never encountered any issue with the differences between timbres. Some people in the community trains larger datasets than me with multi-timbre singers in them, and they have no problem either.

I generally wasn't happy with the results I got with energy and breathiness prior, so I decided to not train using those parameters.

According to experiences from I and other people in our community in China, the variance parameters did not cause deterioration of the quality, but improves stability and controllability. But if you do not train them well, it can cause some problem, and their is some interesting findings in our recent research about the mutual influence between variance modules. These has been updated into the documentation, and a minor release will also be published to notify users about it.

spicytigermeat commented 7 months ago

Thanks for the tip on debugging by inferring directly from the checkpoint, turns out it's either an OpenUTAU issue or a deployment issue, because direct inference from the checkpoint via command line actually gave me the proper output with separate timbre. I'll have to keep messing around with the OpenUTAU library file structure to figure out why it's getting embeds confused, which is my guess. Do they have to be in OPU configs in a certain order that you're aware of?

yqzhishen commented 7 months ago

When exporting to ONNX you should use --export_spk spk1 --export _spk spk2 ... to export all your desired embeds, or if this option is unset, the exporter exports all the embeds. Then you should write them down in the OpenUTAU config as its wiki says, and yes they should be in an ordered list, but in any order you would like.

So you might mixed up your embeds somehow, or it's probably just a personal mistake in the usage of OpenUTAU. You should first check your embeds to see if they are really different, and your configs, and then OpenUTAU itself (for example, use a clean install or reset all the preferences in case there are some misconfiguration in the expression settings).

spicytigermeat commented 7 months ago

First of all, thank you so much for all of your dedication helping me solve this issue, I've learned a ton!

Second of all, I discovered that the issue IS OpenUTAU. Apparently, if the embed files are not in the same directory as the character.yaml file, you have to specify where they are. I thought it pulled from the list of speakers, so it was totally my misunderstanding.