r9y9 / wavenet_vocoder

WaveNet vocoder
https://r9y9.github.io/wavenet_vocoder/
Other
2.31k stars 499 forks source link

Help extending to MAILabs data - Warbly speech - MoL, 1000k steps #183

Open adhamel opened 4 years ago

adhamel commented 4 years ago

Dear @r9y9, I've trained a MoL wavenet to 1000k steps on ~30,000 audio samples from M-AI Labs data. I am using a pre-trained transformer from @kan-bayashi.

The resulting audio has rather intelligible speech, but has a bit of a warble to it that I would like to clear up. Happy to share generated samples or configurations to help diagnose. Do you have any experience training on that data set or recommendations on what might move me in the right direction?

Best, Andy

r9y9 commented 4 years ago

Hi, sorry for the late reply. If I remember correctly, samples in M-AI labs are of low SN ratio, and thus WaveNet might suffer from learning a distribution of clean speech. To diagnose what the reasons would be, could you share some generated audio samples and training configurations?

adhamel commented 4 years ago

Hey, no worries. I trained with the mixture-of-logistics configuration, used data from a single male Spanish speaker. I've followed your recommendations elsewhere and decreased the log_min allowed as the training progressed.

Here is an sample after ~1.6M steps: https://github.com/adhamel/samples/blob/master/response.wav

For evaluation, I'm using generated _npy features from this transformer (https://github.com/espnet/espnet/blob/master/egs/m_ailabs/tts1/RESULTS.md):

v.0.5.3 / Transformer Silence trimming FTT in points: 1024 Shift in points: 256 Frequency limit: 80-7600 Fast-GL 64 iters Environments date: Sun Sep 29 21:20:05 JST 2019 python version: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] espnet version: espnet 0.5.1 chainer version: chainer 6.0.0 pytorch version: pytorch 1.0.1.post2 Git hash: 6b2ff45d1e2c624691f197014b8fe71a5e70bae9 Commit date: Sat Sep 28 14:33:32 2019 +0900

r9y9 commented 4 years ago

Could you also share the config file(s) for WaveNet?

For the generated sample, it seems that the signal gain is too high. I guess there would be a mismatch between acoustic features at the training time and ones at evaluation. Did you carefully normalize acoustic feauters? Did you make sure that you use same acoustic feature pipeline for both training Transformer and WaveNet?

adhamel commented 4 years ago

Absolutely. Here are the overwritten hparams. I also tried an fmin value of 125. I did not take care to normalize acoustic features, however the WaveNet is trained on the same data subset as the Transformer.

{ "name": "wavenet_vocoder", "input_type": "raw", "quantize_channels": 65536, "preprocess": "preemphasis", "postprocess": "inv_preemphasis", "global_gain_scale": 0.55, "sample_rate": 16000, "silence_threshold": 2, "num_mels": 80, "fmin": 80, "fmax": 7600, "fft_size": 1024, "hop_size": 256, "frame_shift_ms": null, "win_length": 1024, "win_length_ms": -1.0, "window": "hann", "highpass_cutoff": 70.0, "output_distribution": "Logistic", "log_scale_min": -32.23619130191664, "out_channels": 30, "layers": 24, "stacks": 4, "residual_channels": 128, "gate_channels": 256, "skip_out_channels": 128, "dropout": 0.0, "kernel_size": 3, "cin_channels": 80, "cin_pad": 2, "upsample_conditional_features": true, "upsample_net": "ConvInUpsampleNetwork", "upsample_params": { "upsample_scales": [ 4, 4, 4, 4 ] }, "gin_channels": -1, "n_speakers": 7, "pin_memory": true, "num_workers": 2, "batch_size": 8, "optimizer": "Adam", "optimizer_params": { "lr": 0.001, "eps": 1e-08, "weight_decay": 0.0 }, "lr_schedule": "step_learning_rate_decay", "lr_schedule_kwargs": { "anneal_rate": 0.5, "anneal_interval": 200000 }, "max_train_steps": 1000000, "nepochs": 2000, "clip_thresh": -1, "max_time_sec": null, "max_time_steps": 10240, "exponential_moving_average": true, "ema_decay": 0.9999, "checkpoint_interval": 100000, "train_eval_interval": 100000, "test_eval_epoch_interval": 50, "save_optimizer_state": true }

r9y9 commented 4 years ago

The harams looks okay. I'd recommend you to double-check acoustic feature normalization differences (if any), and also please check analysis/synthesis quality (not TTS).

Pre-emphasis at the data preprocessing stage changes the signal gain, so you might want to turn global_gain_scale. 0.55 was chosen for LJSpeech if I remember correctly.

Another suggestion is that using more higher log scale min (e.g., -9 or -11). As suggested in ClariNet paper, smaller variance bound requires more iterations for training and could be unstable.

adhamel commented 4 years ago

Thank you, you are correct. I will test reducing log scale min. (As a strange aside, I found significant drops in loss at intervals of ~53 epochs.) I hope y'all are staying safe over there.