Open talka1 opened 3 years ago
Hi @talka1 ,
For the parameters in the configuration, please check the original paper for more details. https://arxiv.org/pdf/2103.02858 But, before digging into the configs, I think the size of the training dataset and the vocoder are more fundamental to the current bad performance.
First, a training set containing only 2 speakers * 90 utterances = 180 utterances is too small. Looking at your MCD values, 11 is too high that I don't think the voices are really converted. Increasing both the number of training speakers and utterances per speaker would be helpful. For example, you can combine your training data with vcc2018 or vcc2020.
As for the vocoder, it is crucial to training a vocoder for your custom data to get perceptually good results. In this repo, we use the Parallel WaveGAN (PWG), and you can check out this repo on how to train your own PWG: https://github.com/kan-bayashi/ParallelWaveGAN. Again, 180 utterances are too limited to train a neural vocoder, so you may consider combining your data with other datasets.
@unilight
Hi @turian,
In the paper you use 75 * 12 = 900 utterances, is that generally enough?
It depends on how you define "enough". More data leads to better performance, but I rarely see studies on the relationship between training data size and the performance in VAE-based VC.
Must you train the vocoder on your corpus or can you just use a pretrained one out-of-the-box? Your comment suggests you need to train it. Can't you finetune an existing preextrained vocoder?
Again, there's really no "can" or "cannot". Everything depends on the performance. All I can say is that speaker-dependent (SD) vocoders are better than speaker-independent (SI) ones, and testing on open speakers further degrades the performance. (See the comparison in https://github.com/kan-bayashi/PytorchWaveNetVocoder#results)
If for one speaker you have a very long utterance, will crank be able to handle that one WAV? Or should you chop it into smaller files?
No, since it is a frame-wise model, it should not suffer from the instability problem as in seq2seq models. Also, the long receptive field should be able to prevent discontinuity.
Hey i wanted to reply back, this time I used 7 Speakers with 282 datafiles each (new dataset). 7 speakers * 282 utterances = 1974 utterances
These were the Results after Stage 7 (200000 steps): mel-cepstrum distortion of speaker 7 only ( the other 6 speakers had similar scores):
spkr7 spkr1 12.168
spkr7 spkr2 11.498
spkr7 spkr3 12.102
spkr7 spkr4 10.805
spkr7 spkr5 11.739
spkr7 spkr6 11.545
spkr7 spkr7 6.860
I am training my own PWG right now but it will take probably a month or two...meanwhile I used a few other pre trained PWGs and all i can say is that by increasing the dataset there was no impovement at all. Infact there was no difference. The Evaluated WAVS still sounded the same way as my other model with only 180 utterances with a few artifacts and missing frequencies. I am missing something here... the generated evals also were twice as slow as the original audiofiles.
I noticed through the devs_evals that they were generated and sounded very similar on both models. I am pretty sure that it has to do with the mlfb_vqvae.yml, these were the settings (left on default):
# feature setting
feature:
label: mlfb # feature label
fs: 22050 # sampling frequency
fftl: 1024 # FFT size
win_length: 1024 # window length, must be fftl >= win_length
hop_size: 128 # hop_size, shift_size
window_types: [hann] # type of FFT window ['hann', 'itu-g']
fmin: 80 # start freq to calculate mlfb
fmax: 7600 # end freq to calculate mlfb
mlfb_dim: 80 # dimension of mlfb
n_iteration: 100 # # iterations of Griffin-Lim
framems: 20 # frame size [ms] for WORLD
shiftms: 5.80499 # shift size [ms] for WORLD
mcep_dim: 34 # dimension of mcep
mcep_alpha: 0.466 # all-path filter coefficient for mcep
# general setting
trainer_type: vqvae # trainer type ['vqvae', 'stargan']
input_feat_type: mlfb # input feature type ['mlfb', 'mcep']
output_feat_type: mlfb # output feature type ['mlfb', 'mcep']
use_raw: false # use raw waveform as network inputs
use_preprocessed_scaler: false # use scaler (require use_raw)
use_sinc_conv: false # use SincConv (require use_raw)
raw_window_type: hann # window for raw input
input_size: 80 # input_size for G ['mlfb_dim', 'mcep_dim+1']
output_size: 80 # output_size for G ['mlfb_dim', 'mcep_dim+1']
n_steps: 200000 # # training steps
dev_steps: 2000 # # development steps
n_steps_save_model: 5000 # save model every # step
n_steps_print_loss: 50 # print loss every # step
batch_size: 50 # # mini-batches
batch_len: 500 # frame length of a mini-batch
cache_dataset: true # cache_dataset (NOTE: fix cv_spkr_label in a mini-batch)
spec_augment: false # apply SpecAugment mask to input feature
n_spec_augment: 0 # # of bands of spec_augment (both time and freq axises)
use_mcep_0th: false # model 0th order of mcep
ignore_scaler: ["raw", "mcep"] # ignore applying scaler ['mlfb', 'mcep', 'lcf0']
sinc_conv_kernel_sizes: 65 # size of sinc conv filter
sinc_conv_channels: 32 # size of sinc conv channels
sinc_conv_down_sample_kernel_sizes: [4, 4, 4, 2] # size of kernels for down sampling layers (product must match hop_size)
# alpha
alpha:
l1: 2
mse: 0
stft: 1
commit: 0.25
dict: 0.5
cycle: 0.1
ce: 1
adv: 1
real: 0.5
fake: 0.5
acgan: 1
stft_params: # parameters for STFT loss
fft_sizes: [64, 128]
win_sizes: [64, 128]
hop_sizes: [16, 32]
logratio: 0 # ratio between log-magnitude and magnitude
optim:
G:
type: adam
lr: 0.0002
decay_size: 0.5
decay_step_size: 200000
clip_grad_norm: 0.0
D:
type: adam
lr: 0.00005
decay_size: 0.5
decay_step_size: 200000
clip_grad_norm: 0.0
C:
type: adam
lr: 0.0001
decay_size: 0.5
decay_step_size: 200000
clip_grad_norm: 0.0
SPKRADV:
type: adam
lr: 0.0001
decay_size: 0.5
decay_step_size: 200000
clip_grad_norm: 0.0
# for G
encoder_f0: false # conditioned on F0 (lcf0, uv) to encoder
decoder_f0: true # conditioned on F0 (lcf0, uv) to decoder
encoder_energy: false # conditioned on energy (cenergy, energy_uv) to encoder
decoder_energy: false # conditioned on energy (cenergy, energy_uv) to decoder
causal: false # use causal network
causal_size: 0 # # of look up frames (allow future frames)
use_spkr_embedding: true # use nn.Embedding instead of 1-hot vector
spkr_embedding_size: 32 # output_size of speaker embedding
ema_flag: true # use exponential moving average
n_vq_stacks: 2 # # of hierarchical VQVAE stacks [1, 2, 3]
n_layers_stacks: [4, 3, 2] # # stacks of layers in the stack
n_layers: [2, 2, 2] # # layers
kernel_size: [5, 3, 3] # # kernels in each stack
emb_dim: [64, 64, 64] # embedding dimension size for VQ
emb_size: [512, 512, 512] # # codebooks
use_spkradv_training: true # use speaker adversarial network
n_spkradv_layers: 3 # # of layers
spkradv_kernel_size: 3 # kernel_size for speaker adversarial net
spkradv_lambda: 0.1 # hyper-parameter of gradient reversal layer
use_spkr_classifier: true # train speaker classifier
n_spkr_classifier_layers: 8 # # of classifier layers
spkr_classifier_kernel_size: 5 # kernel_size for classifier
use_cyclic_training: false # train VQVAE with cyclic constraints
n_steps_cycle_start: 50000 # # steps when starts cycle-consistency training
n_cycles: 1 # # cycles of cyclic architecture
# for D
n_steps_gan_start: 100000 # # of steps to start adv training
gan_type: lsgan # type of GAN ['lsgan']
use_residual_network: true # use ResidualParallelWaveGANDiscriminator
n_discriminator_layers: 2 # # of discriminator layers
n_discriminator_stacks: 4 # # of stacks for Discriminator
discriminator_kernel_size: 5 # kernel_size for discriminator
discriminator_dropout: 0.25 # dropout rate if use_residual_network
train_first: D # updated first ['G', 'D']
switch_update: False # Update only using either real or fake at 1 iteration
cvadv_flag: false # calculate adv_loss using converted or reconstructed
acgan_flag: false # add additional classifier to discriminator
encoder_detach: false # detach encoder and VQ for adv_loss (update decoder only)
use_real_only_acgan: false # train acgan with real sample only
use_D_uv: true # conditioned on uv symbol to D
use_D_spkrcode: true # conditioned on speaker code to D
use_vqvae_loss: true # update G as well as vqvae loss
n_steps_stop_generator: 0 # # of steps for stopping training discriminator
So any suggestions?
are there any detailed informations to all the parameters in the config files and how they affect the audio?
I left it all on default and trained 2 speakers with a dataset of 90 files each.
These were the Results after Stage 7: mel-cepstrum distortion:
mosnet score prediction (0=bad, 5=good):
after around 30000 steps there was not any signifact improvement made up to 200000 steps in the training Stage. I used a pretrained vocoder from the example vcc_2020, since other pretrained pwg vocoders resulted in female voices, but both my speakers were male. the generated wavs had an awful ditorted growl in the lower frequencies and mostly the pitch sounded very flat. So now I am thinking to improve the model. Do i need more data for the speakers? what parameters do i need to change in the config files before starting Stage 2 again?