kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch
https://kan-bayashi.github.io/ParallelWaveGAN/
MIT License
1.57k stars 343 forks source link

[Question] Best config for multispeaker PWGAN #192

Closed george-roussos closed 4 years ago

george-roussos commented 4 years ago

Hi,

I have been trying to train different models on LibriTTS and some internal speakers, but I think the results leave much to be desired. While in single speaker I get good results, in multispeaker it is not the same case. My understanding was multispeaker would yield much better results, since the range of waveforms and sounds is much, much broader and the hours of speech is more than 200. What happens is that it either synthesizes with static, or it sounds muffled. I am talking about TTS synthesis by the way. I of course am not after WaveNet quality, however single speaker tests sound much better.

Taco2 has been trained for a long time, so the spectrograms should be clean, from that side.

My config is this (which I am guessing may have something wrong), which is also what I used for my single speaker training, which yielded good results:

format: "hdf5"
audio:
  clip_norm: true
  do_trim_silence: false
  frame_length_ms: 50
  frame_shift_ms: 12.5
  max_norm: 4
  hop_length: 275
  win_length: 1100
  mel_fmax: 8000.0
  mel_fmin: 0.0
  min_level_db: -100
  num_freq: 1025
  num_mels: 80
  preemphasis: 0.98
  ref_level_db: 20
  sample_rate: 22050
  signal_norm: true
  sound_norm: false
  symmetric_norm: true
  trim_db: 20

generator_params:
    in_channels: 1        # Number of input channels.
    out_channels: 1       # Number of output channels.
    kernel_size: 3        # Kernel size of dilated convolution.
    layers: 30            # Number of residual block layers.
    stacks: 3             # Number of stacks i.e., dilation cycles.
    residual_channels: 64 # Number of channels in residual conv.
    gate_channels: 128    # Number of channels in gated conv.
    skip_channels: 64     # Number of channels in skip conv.
    aux_channels: 80      # Number of channels for auxiliary feature conv.
                          # Must be the same as num_mels.
    aux_context_window: 2 # Context window size for auxiliary feature.
                          # If set to 2, previous 2 and future 2 frames will be considered.
    dropout: 0.0          # Dropout rate. 0.0 means no dropout applied.
    use_weight_norm: true # Whether to use weight norm.
                          # If set to true, it will be applied to all of the conv layers.
    upsample_net: "ConvInUpsampleNetwork" # Upsampling network architecture.
    upsample_params:                      # Upsampling network parameters.
      upsample_scales:
      - 5
      - 5
      - 11

discriminator_params:
    in_channels: 1        # Number of input channels.
    out_channels: 1       # Number of output channels.
    kernel_size: 3        # Number of output channels.
    layers: 10            # Number of conv layers.
    conv_channels: 64     # Number of chnn layers.
    bias: true            # Whether to use bias parameter in conv.
    use_weight_norm: true # Whether to use weight norm.
                          # If set to true, it will be applied to all of the conv layers.
    nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
    nonlinear_activation_params:      # Nonlinear function parameters
        negative_slope: 0.2           # Alpha in LeakyReLU.

stft_loss_params:
    fft_sizes: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
    hop_sizes: [120, 240, 50]     # List of hop size for STFT-based loss
    win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
    window: "hann_window"         # Window function for STFT-based loss

lambda_adv: 4.0  # Loss balancing coefficient.

batch_size: 8              # Batch size.
batch_max_steps: 26125     # Length of each audio in batch. Make sure dividable by hop_size.
pin_memory: true           # Whether to pin memory in Pytorch DataLoader.
num_workers: 8             # Number of workers in Pytorch DataLoader.
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: false          # Whether to allow cache in dataset. If true, it requires cpu memory.

generator_optimizer_params:
    lr: 0.0001             # Generator's learning rate.
    eps: 1.0e-6            # Generator's epsilon.
    weight_decay: 0.0      # Generator's weight decay coefficient.
generator_scheduler_params:
    step_size: 200000      # Generator's scheduler step size.
    gamma: 0.5             # Generator's scheduler gamma.
                           # At each step size, lr will be multiplied by this parameter.
generator_grad_norm: 10    # Generator's gradient norm.
discriminator_optimizer_params:
    lr: 0.00005            # Discriminator's learning rate.
    eps: 1.0e-6            # Discriminator's epsilon.
    weight_decay: 0.0      # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
    step_size: 200000      # Discriminator's scheduler step size.
    gamma: 0.5             # Discriminator's scheduler gamma.
                           # At each step size, lr will be multiplied by this parameter.
discriminator_grad_norm: 1 # Discriminator's gradient norm.

discriminator_train_start_steps: 100000 # Number of steps to start to train discriminator.
train_max_steps: 1000000                 # Number of training steps.
save_interval_steps: 5000               # Interval steps to save checkpoint.
eval_interval_steps: 1000               # Interval steps to evaluate the network.
log_interval_steps: 100                 # Interval steps to record the training log.

num_save_intermediate_results: 4  # Number of results to be saved as intermediate results.

I have downsampled the dataset and changed the hop and win sizes to accomodate with my TTS attributes.

Screenshot 2020-07-22 at 15 02 43 Screenshot 2020-07-22 at 15 02 49
kan-bayashi commented 4 years ago

My understanding was multispeaker would yield much better results, since the range of waveforms and sounds is much, much broader and the hours of speech is more than 200.

In my experiments, the model with long single speaker data (say, around 24 hours) is better than the case with multi speaker data (e.g., serveral minutes x various speakers). So I think the important point is that the length of each speaker's speech and its quality. If you add bad quality speaker, it may cause quality degradation.

george-roussos commented 4 years ago

This is exactly what I thought at first, and I would care to wager that some LibriTTS recordings have a lot of noise. My single speaker, admittedly, had no background noise. However, then I thought that during the GAN training noise is introduced, no? So I thought introducing noise within the set may make the model more robust. I got the exact same noise you got in your pretrained LibriTTS model, like a "zzzzzz" kind of thing. But I checked your VCTK pretrained model on multiband melgan and it sounds much, much better.

I also think you have a point in duration of speech for every speaker; although, I think it then becomes harder to tackle as a problem. Have you done any more experiments with multispeaker? I guess I would be willing to train one more model and carefully do data curation, but I am not really sure how many speakers are needed for the model to be able to generalize to other speakers.

kan-bayashi commented 4 years ago

However, then I thought that during the GAN training noise is introduced, no?

What is the noise?

But I checked your VCTK pretrained model on multiband melgan and it sounds much, much better.

I think VCTK is better recording than libritts. Did you compare my samples of libritts and vctk?

but I am not really sure how many speakers are needed for the model to be able to generalize to other speakers.

This is very difficult question. I have no clear answer.

george-roussos commented 4 years ago

I do not have samples at hand right not unfortunately, but it was something like a static noise that sounded like "zzzz". Metallic-like, kind of. If you take a listen at your LibriTTS and VCTK gen samples, the VCTK gen sounds so much clearer and very, very close to ground truth. So I think you are right in saying it depends on the dataset sound quality.

Now I read this paper here and they report generalizing to unseen speakers on 60 hours (10 hours for 6 speakers). I will try it and report here. If it does not work, I think I will also try training a PWGAN on VCTK.