shaojinding / Adversarial-Many-to-Many-VC

[InterSpeech 2020] "Improving the Speaker Identity of Non-Parallel Many-to-Many VoiceConversion with Adversarial Speaker Recognition" by Shaojin Ding, Guanlong Zhao, Ricardo Gutierrez-Osuna
Other
39 stars 12 forks source link

Where to get VCTK alignments? #4

Closed FarisHijazi closed 3 years ago

FarisHijazi commented 4 years ago

Hello, I can't find the VCTK dataset alignments anywhere, and I did find this method from deepvoice3. but I'm not even sure if it's compatible.

Could you please upload the VCTK alignment files? or have a way to generate them?

shaojinding commented 4 years ago

Hi FarisHijazi,

Thanks for your interests of this work! If I understand your question correctly, this work does not really need the alignments during training or inference. Please let me know if you were referring to anything else.

JeffC0628 commented 4 years ago

Hi FarisHijazi,

Thanks for your interests of this work! If I understand your question correctly, this work does not really need the alignments during training or inference. Please let me know if you were referring to anything else.

I found the VCTK dataset dose not match the procedure of the _synthesizer_preprocessaudio.py .the process of synthesizer_preprocess_audio.py looks like to deal with the libritts dataset

shaojinding commented 4 years ago

Oh, I see. I pushed the wrong version to the repo, has fixed at https://github.com/shaojinding/Adversarial-Many-to-Many-VC/commit/37da0fc7b9ce585cd578bcde0cc61567a104a67d

Let me know if it works. Thanks

JeffC0628 commented 3 years ago

Oh, I see. I pushed the wrong version to the repo, has fixed at https://github.com/shaojinding/Adversarial-Many-to-Many-VC/commit/37da0fc7b9ce585cd578bcde0cc61567a104a67d

Let me know if it works. Thanks

thanks for reply, it's still have a problem, synthesizer/preprocess.py in line 159, the function of _process_utterance(wav, out_dir, wav_cat_fname, skipexisting, hparams) , miss a paramters of text:str

FarisHijazi commented 3 years ago

I see that you don't need the alignment times, but the vctk preprocess code does look for alignments. I'll fix that with a try/except and I'll submit a pr (fyi there were many bugs for the librispeech preprocessing). It seems you preferred vctk. I fixed most of them, expect a pr from me soon. I'll close the issue when i verify that it works

JeffC0628 commented 3 years ago

I see that you don't need the alignment times, but the vctk preprocess code does look for alignments. I'll fix that with a try/except and I'll submit a pr (fyi there were many bugs for the librispeech preprocessing). It seems you preferred vctk. I fixed most of them, expect a pr from me soon. I'll close the issue when i verify that it works

Actually,I have already set non for vctk text alignments, the data preprocess is done, but when I run the _synthesizertrain.py the ValueError appeared:

Traceback (most recent call last):
  File "synthesizer_train.py", line 56, in <module>
    tacotron_train(args, log_dir, hparams)
  File "adversarial-many-to-many-vc/synthesizer/train.py", line 408, in tacotron_train
    return train(log_dir, args, hparams)
  File "adversarial-many-to-many-vc/synthesizer/train.py", line 159, in train
    model, stats = model_train_mode(args, feeder, hparams, global_step)
  File "adversarial-many-to-many-vc/synthesizer/train.py", line 98, in model_train_mode
    model.add_optimizer(global_step)
  File "adversarial-many-to-many-vc/synthesizer/models/tacotron.py", line 529, in add_optimizer
    expanded_g = tf.expand_dims(g, 0)
  File "/anaconda3/envs/ppg-vc/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/anaconda3/envs/ppg-vc/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/anaconda3/envs/ppg-vc/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 148, in expand_dims
    return expand_dims_v2(input, axis, name)
  File "/anaconda3/envs/ppg-vc/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/anaconda3/envs/ppg-vc/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 197, in expand_dims_v2
    return gen_array_ops.expand_dims(input, axis, name)
  File "/anaconda3/envs/ppg-vc/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 2459, in expand_dims
    "ExpandDims", input=input, dim=axis, name=name)
  File "/anaconda3/envs/ppg-vc/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 545, in _apply_op_helper
    (input_name, err))
ValueError: Tried to convert 'input' to a tensor and failed. Error: None values not supported.

and here is my config:

-----------------------------------------------------------------
Starting new vc_adversarial training run
-----------------------------------------------------------------
[2020-12-04 09:48:35.874]  Checkpoint path: synthesizer/saved_models/logs-vc_adversarial/taco_pretrained/tacotron_model.ckpt
[2020-12-04 09:48:35.874]  Loading training data from: data/SV2TTS/synthesizer_train/train.txt
[2020-12-04 09:48:35.874]  Using model: Tacotron
[2020-12-04 09:48:35.875]  Hyperparameters:
  allow_clipping_in_normalization: True
  attention_dim: 128
  attention_filters: 32
  attention_kernel: (31,)
  cbhg_conv_channels: 128
  cbhg_highway_units: 128
  cbhg_highwaynet_layers: 4
  cbhg_kernels: 8
  cbhg_pool_size: 2
  cbhg_projection: 256
  cbhg_projection_kernel_size: 3
  cbhg_rnn_units: 128
  cleaners: english_cleaners
  clip_for_wavenet: True
  clip_mels_length: True
  cross_entropy_pos_weight: 20
  cumulative_weights: True
  decoder_layers: 2
  decoder_lstm_units: 1024
  embedding_dim: 512
  enc_conv_channels: 512
  enc_conv_kernel_size: (5,)
  enc_conv_num_layers: 3
  enc_prenet_layers: [128, 256]
  encoder_lstm_units: 256
  fmax: 7600
  fmin: 55
  frame_shift_ms: None
  griffin_lim_iters: 60
  hop_size: 200
  if_use_speaker_classifier: False
  is_encoder_lstm_2layers: False
  is_encoder_lstm_pyramid: True
  mask_decoder: False
  mask_encoder: True
  max_abs_value: 4.0
  max_iters: 2000
  max_mel_frames: 900
  min_level_db: -100
  n_fft: 800
  n_speakers: 105
  natural_eval: False
  normalize_for_wavenet: True
  num_mels: 80
  num_ppgs: 40
  outputs_per_step: 1
  postnet_channels: 512
  postnet_kernel_size: (5,)
  postnet_num_layers: 5
  power: 1.5
  predict_linear: False
  preemphasis: 0.97
  preemphasize: True
  prenet_layers: [256, 256]
  ref_level_db: 20
  rescale: False
  rescaling_max: 0.9
  sample_rate: 16000
  signal_normalization: True
  silence_min_duration_split: 0.4
  silence_threshold: 2
  smoothing: False
  speaker_embedding_size: 256
  split_on_cpu: True
  stop_at_any: True
  symmetric_mels: True
  tacotron_adam_beta1: 0.9
  tacotron_adam_beta2: 0.999
  tacotron_adam_epsilon: 1e-06
  tacotron_batch_size: 36
  tacotron_clip_gradients: True
  tacotron_data_random_state: 1234
  tacotron_decay_learning_rate: True
  tacotron_decay_rate: 0.5
  tacotron_decay_steps: 50000
  tacotron_dropout_rate: 0.5
  tacotron_final_learning_rate: 1e-05
  tacotron_gpu_start_idx: 3
  tacotron_initial_learning_rate: 0.001
  tacotron_num_gpus: 1
  tacotron_random_seed: 5339
  tacotron_reg_weight: 1e-07
  tacotron_scale_regularization: False
  tacotron_start_decay: 50000
  tacotron_swap_with_cpu: False
  tacotron_synthesis_batch_size: 128
  tacotron_teacher_forcing_decay_alpha: 0.0
  tacotron_teacher_forcing_decay_steps: 280000
  tacotron_teacher_forcing_final_ratio: 0.0
  tacotron_teacher_forcing_init_ratio: 1.0
  tacotron_teacher_forcing_mode: constant
  tacotron_teacher_forcing_ratio: 1.0
  tacotron_teacher_forcing_start_decay: 10000
  tacotron_test_batches: None
  tacotron_test_size: 0.05
  tacotron_zoneout_rate: 0.1
  train_with_GTA: False
  trim_fft_size: 512
  trim_hop_size: 128
  trim_top_db: 23
  use_full_ppg: False
  use_lws: False
  utterance_min_duration: 1.6
  win_size: 800
[2020-12-04 09:48:36.039]  Loaded metadata for 32881 examples (25.81 hours)
[2020-12-04 09:48:46.382]  initialisation done /gpu:3
[2020-12-04 09:48:46.382]  Initialized Tacotron model. Dimensions (? = dynamic shape): 
[2020-12-04 09:48:46.382]    Train mode:               True
[2020-12-04 09:48:46.382]    Eval mode:                False
[2020-12-04 09:48:46.382]    GTA mode:                 False
[2020-12-04 09:48:46.382]    Synthesis mode:           False
[2020-12-04 09:48:46.382]    Input:                    (?, ?, 40)
[2020-12-04 09:48:46.382]    device:                   3
[2020-12-04 09:48:46.382]    embedding:                (?, ?, 40)
[2020-12-04 09:48:46.382]    enc conv out:             (?, ?, 512)
[2020-12-04 09:48:46.382]    adversial classifier out: ?
[2020-12-04 09:48:46.382]    encoder out (cond):       (?, ?, 768)
[2020-12-04 09:48:46.382]    decoder out:              (?, ?, 80)
[2020-12-04 09:48:46.382]    residual out:             (?, ?, 512)
[2020-12-04 09:48:46.382]    projected residual out:   (?, ?, 80)
[2020-12-04 09:48:46.382]    mel out:                  (?, ?, 80)
[2020-12-04 09:48:46.382]    <stop_token> out:         (?, ?)
[2020-12-04 09:48:46.384]    Tacotron Parameters       29.271 Million.

so i guess somewhere must wrong in code,maybe is the data propress, i`m still debug it, thanks for your help