kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch
https://kan-bayashi.github.io/ParallelWaveGAN/
MIT License
1.55k stars 340 forks source link

StyleMelGAN tuning #282

Closed kan-bayashi closed 3 years ago

kan-bayashi commented 3 years ago

2020/08/06

Learning rate scheduling maybe need to investigate.

kan-bayashi commented 3 years ago
スクリーンショット 2021-08-06 14 22 45
kan-bayashi commented 3 years ago

2021/08/07

スクリーンショット 2021-08-07 8 25 56

Hinge loss version is worse than MSE version at 200k iters. The sample > 200k iters with MSE sounds very good but a bit noisy than HiFiGAN. At the end of audio, there is a pop noise. Maybe I can remove it by considering the padding in the inference.

sample_audios_style.zip

kan-bayashi commented 3 years ago

Pop noise is removed. #283

kan-bayashi commented 3 years ago

2021/08/09

スクリーンショット 2021-08-09 8 42 05

Hinge loss version gradually improved but still worse than MSE at 650k. v1 and v2 sound almost the same but v1 sounds slightly better in terms of SNR. (stft_magnitude_loss difference does not appear in perceptual quality.)

20210809_style_melgan_samples.zip

kan-bayashi commented 3 years ago

2021/08/19

I tested on various dataset, all of them are working well.

スクリーンショット 2021-08-19 14 27 34

In my feeling, hifigan is slightly better than style melgan especially when the data including small white noise or reverberation. But thanks to the use of noise, in the case of the combination with text2mel without fine-tuning, style melgan might be better.

azraelkuan commented 3 years ago

some questions:

  1. it seems real loss, fake loss, adv loss do not change a lot in the training process, this different from pwg or melgan?
  2. do u try combine the melgan G(other G) + style melgan D?
kan-bayashi commented 3 years ago

it seems real loss, fake loss, adv loss do not change a lot in the training process, this different from pwg or melgan?

In PWG case, the discriminator loss is not changed, keep the same value. So StyleMelGAN is similar to PWG.

do u try combine the melgan G(other G) + style melgan D?

I tried PWG + HiFiGAN D, which works well. But I've not yet tried the combination with style melgan.

azraelkuan commented 3 years ago

yeah, i notice that, the real loss and fake loss in melgan D should be changed from 0.25 -> 0.1 (LSGAN), so i am very confused about the loss. may be the stylemelgan G is too better?

also i notice that this is a problem in stylemelgan D: https://github.com/kan-bayashi/ParallelWaveGAN/blob/6a5f2f9e2f39421f385f88b9123260b8a46d87c0/parallel_wavegan/models/style_melgan.py#L332

for different repeats, u can get the chunk data first, and cat them in batch, then input the disc, this can speed up training.

kan-bayashi commented 3 years ago

That is a nice idea, I will consider it.

wblgers commented 2 years ago

it seems real loss, fake loss, adv loss do not change a lot in the training process, this different from pwg or melgan?

In PWG case, the discriminator loss is not changed, keep the same value. So StyleMelGAN is similar to PWG.

do u try combine the melgan G(other G) + style melgan D?

I tried PWG + HiFiGAN D, which works well. But I've not yet tried the combination with style melgan.

Hi, can you share PWG + HiFiGAN D config?

skol101 commented 2 years ago

@kan-bayashi would you share that config ?

kan-bayashi commented 2 years ago

Just an example.

Edited: fixed batch size and batch max steps.

###########################################################
#                FEATURE EXTRACTION SETTING               #
###########################################################
sampling_rate: 24000     # Sampling rate.
fft_size: 2048           # FFT size.
hop_size: 300            # Hop size.
win_length: 1200         # Window length.
                         # If set to null, it will be the same as fft_size.
window: "hann"           # Window function.
num_mels: 80             # Number of mel basis.
fmin: 80                 # Minimum freq in mel basis calculation.
fmax: 7600               # Maximum frequency in mel basis calculation.
global_gain_scale: 1.0   # Will be multiplied to all of waveform.
trim_silence: false      # Whether to trim the start and end of silence.
trim_threshold_in_db: 60 # Need to tune carefully if the recording is not good.
trim_frame_size: 2048    # Frame size in trimming.
trim_hop_size: 512       # Hop size in trimming.
format: "hdf5"           # Feature file format. "npy" or "hdf5" is supported.

###########################################################
#         GENERATOR NETWORK ARCHITECTURE SETTING          #
###########################################################
generator_params:
    in_channels: 1         # Number of input channels.
    out_channels: 1        # Number of output channels.
    kernel_size: 5         # Kernel size of dilated convolution.
    layers: 30             # Number of residual block layers.
    stacks: 3              # Number of stacks i.e., dilation cycles.
    residual_channels: 64  # Number of channels in residual conv.
    gate_channels: 128     # Number of channels in gated conv.
    skip_channels: 64      # Number of channels in skip conv.
    aux_channels: 80       # Number of channels for auxiliary feature conv.
                           # Must be the same as num_mels.
    aux_context_window: 2  # Context window size for auxiliary feature.
                           # If set to 2, previous 2 and future 2 frames will be considered.
                           # But if use causal conv, only previous 2 frame will be considered.
    dropout: 0.0           # Dropout rate. 0.0 means no dropout applied.
    use_weight_norm: true  # Whether to use weight norm.
                           # If set to true, it will be applied to all of the conv layers.
    use_causal_conv: false # Whether to use causal convolution.
    upsample_net: "ConvInUpsampleNetwork" # Upsampling network architecture.
    upsample_params:                      # Upsampling network parameters.
        upsample_scales: [4, 5, 3, 5]     # Upsampling scales. Prodcut of these must be the same as hop size.

###########################################################
#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
###########################################################
discriminator_type: HiFiGANMultiScaleMultiPeriodDiscriminator
discriminator_params:
    scales: 3                              # Number of multi-scale discriminator.
    scale_downsample_pooling: "AvgPool1d"  # Pooling operation for scale discriminator.
    scale_downsample_pooling_params:
        kernel_size: 4                     # Pooling kernel size.
        stride: 2                          # Pooling stride.
        padding: 2                         # Padding size.
    scale_discriminator_params:
        in_channels: 1                     # Number of input channels.
        out_channels: 1                    # Number of output channels.
        kernel_sizes: [15, 41, 5, 3]       # List of kernel sizes.
        channels: 128                      # Initial number of channels.
        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
        max_groups: 16                     # Maximum number of groups in downsampling conv layers.
        bias: true
        downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
        nonlinear_activation: "LeakyReLU"  # Nonlinear activation.
        nonlinear_activation_params:
            negative_slope: 0.1
    follow_official_norm: true             # Whether to follow the official norm setting.
    periods: [2, 3, 5, 7, 11]              # List of period for multi-period discriminator.
    period_discriminator_params:
        in_channels: 1                     # Number of input channels.
        out_channels: 1                    # Number of output channels.
        kernel_sizes: [5, 3]               # List of kernel sizes.
        channels: 32                       # Initial number of channels.
        downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
        bias: true                         # Whether to use bias parameter in conv layer."
        nonlinear_activation: "LeakyReLU"  # Nonlinear activation.
        nonlinear_activation_params:       # Nonlinear activation paramters.
            negative_slope: 0.1
        use_weight_norm: true              # Whether to apply weight normalization.
        use_spectral_norm: false           # Whether to apply spectral normalization.

###########################################################
#                   STFT LOSS SETTING                     #
###########################################################
use_stft_loss: false                 # Whether to use multi-resolution STFT loss.
use_mel_loss: true                   # Whether to use Mel-spectrogram loss.
mel_loss_params:
    fs: 24000
    fft_size: 2048
    hop_size: 300
    win_length: 1200
    window: "hann"
    num_mels: 80
    fmin: 0
    fmax: 11025
    log_base: null
generator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
use_feat_match_loss: true
feat_match_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
    average_by_layers: false         # Whether to average loss by #layers in each discriminator.
    include_final_outputs: false     # Whether to include final outputs in feat match loss calculation.

###########################################################
#               ADVERSARIAL LOSS SETTING                  #
###########################################################
lambda_aux: 45.0       # Loss balancing coefficient for STFT loss.
lambda_adv: 1.0        # Loss balancing coefficient for adversarial loss.
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..

###########################################################
#                  DATA LOADER SETTING                    #
###########################################################
batch_size: 16               # Batch size.
batch_max_steps: 8400       # Length of each audio in batch. Make sure dividable by hop_size.
pin_memory: true            # Whether to pin memory in Pytorch DataLoader.
num_workers: 2              # Number of workers in Pytorch DataLoader.
remove_short_samples: false # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true           # Whether to allow cache in dataset. If true, it requires cpu memory.

###########################################################
#             OPTIMIZER & SCHEDULER SETTING               #
###########################################################
generator_optimizer_type: Adam
generator_optimizer_params:
    lr: 2.0e-4
    betas: [0.5, 0.9]
    weight_decay: 0.0
generator_scheduler_type: MultiStepLR
generator_scheduler_params:
    gamma: 0.5
    milestones:
        - 200000
        - 400000
        - 600000
        - 800000
generator_grad_norm: -1
discriminator_optimizer_type: Adam
discriminator_optimizer_params:
    lr: 2.0e-4
    betas: [0.5, 0.9]
    weight_decay: 0.0
discriminator_scheduler_type: MultiStepLR
discriminator_scheduler_params:
    gamma: 0.5
    milestones:
        - 200000
        - 400000
        - 600000
        - 800000
discriminator_grad_norm: -1

###########################################################
#                    INTERVAL SETTING                     #
###########################################################
generator_train_start_steps: 1     # Number of steps to start to train discriminator.
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 1500000           # Number of training steps.
save_interval_steps: 100000        # Interval steps to save checkpoint.
eval_interval_steps: 5000          # Interval steps to evaluate the network.
log_interval_steps: 100            # Interval steps to record the training log.

###########################################################
#                     OTHER SETTING                       #
###########################################################
num_save_intermediate_results: 4  # Number of results to be saved as intermediate results.
skol101 commented 2 years ago

batch_size is 4? Why so?

kan-bayashi commented 2 years ago

Sorry this is for ngpus=4 so it is set to 4. You can use 16.

skol101 commented 2 years ago

batch_max_steps value is divisible by hop_size , but the result isn't a whole number -- 27.30(6). Is this ok?

kan-bayashi commented 2 years ago

Sorry for my mistake. Better to use 8100 or 8400.

skol101 commented 2 years ago

it seems real loss, fake loss, adv loss do not change a lot in the training process, this different from pwg or melgan?

In PWG case, the discriminator loss is not changed, keep the same value. So StyleMelGAN is similar to PWG.

do u try combine the melgan G(other G) + style melgan D?

I tried PWG + HiFiGAN D, which works well. But I've not yet tried the combination with style melgan.

I'm trying Stylemelgan G with HifiGan D (all with default params, other settings copied from the PWG + HifiGan), but I get the error in step 1 of the training

 The size of tensor a (28) must match the size of tensor b (80) at non-singleton dimension 2

Here's the config

###########################################################
#                FEATURE EXTRACTION SETTING               #
###########################################################
sampling_rate: 24000     # Sampling rate.
fft_size: 2048           # FFT size.
hop_size: 300            # Hop size.
win_length: 1200         # Window length.
                         # If set to null, it will be the same as fft_size.
window: "hann"           # Window function.
num_mels: 80             # Number of mel basis.
fmin: 80                 # Minimum freq in mel basis calculation.
fmax: 7600               # Maximum frequency in mel basis calculation.
global_gain_scale: 1.0   # Will be multiplied to all of waveform.
trim_silence: false      # Whether to trim the start and end of silence.
trim_threshold_in_db: 60 # Need to tune carefully if the recording is not good.
trim_frame_size: 1024    # Frame size in trimming.
trim_hop_size: 256       # Hop size in trimming.
format: "hdf5"           # Feature file format. "npy" or "hdf5" is supported.

###########################################################
#         GENERATOR NETWORK ARCHITECTURE SETTING          #
###########################################################
generator_type: "StyleMelGANGenerator" # Generator type.
generator_params:
    in_channels: 128
    aux_channels: 80
    channels: 64
    out_channels: 1
    kernel_size: 9
    dilation: 2
    bias: True
    noise_upsample_scales: [10, 2, 2, 2]
    noise_upsample_activation: "LeakyReLU"
    noise_upsample_activation_params:
        negative_slope: 0.2
    upsample_scales: [5, 1, 5, 1, 3, 1, 2, 2, 1]
    upsample_mode: "nearest"
    gated_function: "softmax"
    use_weight_norm: True

###########################################################
#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
###########################################################
discriminator_type: HiFiGANMultiScaleMultiPeriodDiscriminator
discriminator_params:
    scales: 3                              # Number of multi-scale discriminator.
    scale_downsample_pooling: "AvgPool1d"  # Pooling operation for scale discriminator.
    scale_downsample_pooling_params:
        kernel_size: 4                     # Pooling kernel size.
        stride: 2                          # Pooling stride.
        padding: 2                         # Padding size.
    scale_discriminator_params:
        in_channels: 1                     # Number of input channels.
        out_channels: 1                    # Number of output channels.
        kernel_sizes: [15, 41, 5, 3]       # List of kernel sizes.
        channels: 128                      # Initial number of channels.
        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
        max_groups: 16                     # Maximum number of groups in downsampling conv layers.
        bias: true
        downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
        nonlinear_activation: "LeakyReLU"  # Nonlinear activation.
        nonlinear_activation_params:
            negative_slope: 0.1
    follow_official_norm: true             # Whether to follow the official norm setting.
    periods: [2, 3, 5, 7, 11]              # List of period for multi-period discriminator.
    period_discriminator_params:
        in_channels: 1                     # Number of input channels.
        out_channels: 1                    # Number of output channels.
        kernel_sizes: [5, 3]               # List of kernel sizes.
        channels: 32                       # Initial number of channels.
        downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
        bias: true                         # Whether to use bias parameter in conv layer."
        nonlinear_activation: "LeakyReLU"  # Nonlinear activation.
        nonlinear_activation_params:       # Nonlinear activation paramters.
            negative_slope: 0.1
        use_weight_norm: true              # Whether to apply weight normalization.
        use_spectral_norm: false           # Whether to apply spectral normalization.

###########################################################
#                   STFT LOSS SETTING                     #
###########################################################
use_stft_loss: false                 # Whether to use multi-resolution STFT loss.
use_mel_loss: true                   # Whether to use Mel-spectrogram loss.
mel_loss_params:
    fs: 24000
    fft_size: 2048
    hop_size: 300
    win_length: 1200
    window: "hann"
    num_mels: 80
    fmin: 0
    fmax: 12000
    log_base: null
generator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
use_feat_match_loss: true
feat_match_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
    average_by_layers: false         # Whether to average loss by #layers in each discriminator.
    include_final_outputs: false     # Whether to include final outputs in feat match loss calculation.

###########################################################
#               ADVERSARIAL LOSS SETTING                  #
###########################################################
lambda_aux: 45.0       # Loss balancing coefficient for STFT loss.
lambda_adv: 1.0        # Loss balancing coefficient for adversarial loss.
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..

###########################################################
#                  DATA LOADER SETTING                    #
###########################################################
batch_size: 16               # Batch size.
batch_max_steps: 8400       # Length of each audio in batch. Make sure dividable by hop_size.
pin_memory: true            # Whether to pin memory in Pytorch DataLoader.
num_workers: 2              # Number of workers in Pytorch DataLoader.
remove_short_samples: false # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true           # Whether to allow cache in dataset. If true, it requires cpu memory.

###########################################################
#             OPTIMIZER & SCHEDULER SETTING               #
###########################################################
generator_optimizer_type: Adam
generator_optimizer_params:
    lr: 2.0e-4
    betas: [0.5, 0.9]
    weight_decay: 0.0
generator_scheduler_type: MultiStepLR
generator_scheduler_params:
    gamma: 0.5
    milestones:
        - 200000
        - 400000
        - 600000
        - 800000
generator_grad_norm: -1
discriminator_optimizer_type: Adam
discriminator_optimizer_params:
    lr: 2.0e-4
    betas: [0.5, 0.9]
    weight_decay: 0.0
discriminator_scheduler_type: MultiStepLR
discriminator_scheduler_params:
    gamma: 0.5
    milestones:
        - 200000
        - 400000
        - 600000
        - 800000
discriminator_grad_norm: -1

###########################################################
#                    INTERVAL SETTING                     #
###########################################################
generator_train_start_steps: 1     # Number of steps to start to train discriminator.
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 1500000           # Number of training steps.
save_interval_steps: 25000        # Interval steps to save checkpoint.
eval_interval_steps: 5000          # Interval steps to evaluate the network.
log_interval_steps: 100            # Interval steps to record the training log.

###########################################################
#                     OTHER SETTING                       #
###########################################################
num_save_intermediate_results: 4  # Number of results to be saved as intermediate results.