stylemelagn model structure change for 32Khz vocoder

wblgers commented 2 years ago

Dear professor,

I'd like to train stylemelgan vocoder of 32kHz, here is my config to train a multi-speaker model, now the speaker similarity on VC task is worse than fregan/hifigan. Can you give me some advice to improve quality. Here are two points I want to try: (1)improve in_channels from 128 to 256 ; (2)improve conv kernal size; Thanks !

###########################################################
#                FEATURE EXTRACTION SETTING               #
###########################################################
sampling_rate: 32000     # Sampling rate.
fft_size: 2048           # FFT size.
hop_size: 320            # Hop size.
win_length: 1600         # Window length.
                         # If set to null, it will be the same as fft_size.
window: "hann"           # Window function.
num_mels: 80             # Number of mel basis.
fmin: 80                 # Minimum freq in mel basis calculation.
fmax: 7600               # Maximum frequency in mel basis calculation.
global_gain_scale: 1.0   # Will be multiplied to all of waveform.
trim_silence: false      # Whether to trim the start and end of silence.
trim_threshold_in_db: 60 # Need to tune carefully if the recording is not good.
trim_frame_size: 2048    # Frame size in trimming.
trim_hop_size: 512       # Hop size in trimming.
format: "npy"           # Feature file format. " npy " or " hdf5 " is supported.

min_level_db: -100
ref_level_db: 20.0

preemphasis: 0.97
###########################################################
#         GENERATOR NETWORK ARCHITECTURE SETTING          #
###########################################################
generator_type: "StyleMelGANGenerator" # Generator type.
generator_params:
    in_channels: 128
    aux_channels: 80
    channels: 64
    out_channels: 1
    kernel_size: 9
    dilation: 2
    bias: True
    noise_upsample_scales: [5, 5, 2, 2]
    noise_upsample_activation: "LeakyReLU"
    noise_upsample_activation_params:
        negative_slope: 0.2
    upsample_scales: [5, 1, 4, 1, 4, 1, 2, 2, 1]
    upsample_mode: "nearest"
    gated_function: "softmax"
    use_weight_norm: True

###########################################################
#       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
###########################################################
discriminator_type: "StyleMelGANDiscriminator" # Discriminator type.
discriminator_params:
    repeats: 4
    window_sizes: [512, 1024, 2048, 4096]
    pqmf_params:
        - [1, None, None, None]
        - [2, 62, 0.26700, 9.0]
        - [4, 62, 0.14200, 9.0]
        - [8, 62, 0.07949, 9.0]
    discriminator_params:
        out_channels: 1
        kernel_sizes: [5, 3]
        channels: 16
        max_downsample_channels: 512
        bias: True
        downsample_scales: [4, 4, 4, 1]
        nonlinear_activation: "LeakyReLU"
        nonlinear_activation_params:
            negative_slope: 0.2
    use_weight_norm: True

###########################################################
#                   STFT LOSS SETTING                     #
###########################################################
stft_loss_params:
    fft_sizes: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
    hop_sizes: [120, 240, 50]     # List of hop size for STFT-based loss
    win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
    window: "hann_window"         # Window function for STFT-based loss
lambda_aux: 1.0                   # Loss balancing coefficient for aux loss.

###########################################################
#               ADVERSARIAL LOSS SETTING                  #
###########################################################
lambda_adv: 1.0 # Loss balancing coefficient for adv loss.
generator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.

###########################################################
#                  DATA LOADER SETTING                    #
###########################################################
batch_size: 32              # Batch size.
batch_max_steps: 32000      # Length of each audio in batch. Make sure dividable by hop_size.
pin_memory: true            # Whether to pin memory in Pytorch DataLoader.
num_workers: 4              # Number of workers in Pytorch DataLoader.
remove_short_samples: false # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: false           # Whether to allow cache in dataset. If true, it requires cpu memory.

###########################################################
#             OPTIMIZER & SCHEDULER SETTING               #
###########################################################
generator_optimizer_type: Adam
generator_optimizer_params:
    lr: 4.0e-4
    betas: [0.5, 0.9]
    weight_decay: 0.0
generator_scheduler_type: MultiStepLR
generator_scheduler_params:
    gamma: 0.5
    milestones:
        - 100000
        - 300000
        - 500000
        - 700000
        - 900000
        - 110000
generator_grad_norm: -1
discriminator_optimizer_type: Adam
discriminator_optimizer_params:
    lr: 2.0e-4
    betas: [0.5, 0.9]
    weight_decay: 0.0
discriminator_scheduler_type: MultiStepLR
discriminator_scheduler_params:
    gamma: 0.5
    milestones:
        - 200000
        - 400000
        - 600000
        - 800000
discriminator_grad_norm: -1

###########################################################
#                    INTERVAL SETTING                     #
###########################################################
discriminator_train_start_steps: 100000 # Number of steps to start to train discriminator.
train_max_steps: 1500000                # Number of training steps.
save_interval_steps: 50000              # Interval steps to save checkpoint.
eval_interval_steps: 1000              # Interval steps to evaluate the network.
log_interval_steps: 100                # Interval steps to record the training log.

###########################################################
#                     OTHER SETTING                       #
###########################################################
num_save_intermediate_results: 4  # Number of results to be saved as intermediate results.

MikeAleksa commented 2 years ago

@wblgers have you seen any improvement in StyleMelGAN at higher sampling rates by changing input channels or kernel size?

wblgers commented 2 years ago

conv kernal size

I'm working on it. Since the multi-gpu training does not speed up training as expceted, the training is slow. I'll give out some results when finished.

MikeAleksa commented 2 years ago

@wblgers Great - I'm testing increasing in_channels and will respond when I have results as well.

As far as multi-gpu training, I have found that increasing the LR and reducing steps between scheduled LR changes helps speed up training.

My understanding is by using the same config with multi-gpu training, you are only increasing the batch size. Since batch size is bigger, you can increase LR proportional to batch size since the step taken by optimizer should be better due to increased batch size.

I tried multiplying LR by 4 for 8 GPUs, and decreasing number of steps in each scheduler change by 1/2. I'm not sure what the best settings are, though. You may be able to do things 1:1 for fastest training (e.g. 8x LR for 8 GPUs and divide number of steps by 8) but I'm not sure.

kan-bayashi / ParallelWaveGAN

stylemelagn model structure change for 32Khz vocoder #370