r9y9 / deepvoice3_pytorch

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models
https://r9y9.github.io/deepvoice3_pytorch/
Other
1.97k stars 485 forks source link

Speech clarity tails off at end of sentences #99

Closed nmstoker closed 6 years ago

nmstoker commented 6 years ago

I have been training using ljspeech initially and then training further with about three hours of my own voice data and although the results clearly have a recognisable intonation and are quite good at the start of a sentence, the quality rapidly tails off at the end of sentences.

It starts off fairly clear but then becomes incoherent (as per the .wav file in the attached zip)

The attached audio is meant to be saying this sample sentence:

"Why don't main characters of the walking dead shield their legs from crawling walkers?"

It only really gets as far as "...walking dead" and then is becomes hard to hear.

18_checkpoint_step000500000.zip

I went over my training data and initially found (and removed) a few cases where the recording had stopped too early (so the audio was shorter than the accompanying sentence). I had expected those would be impairing the ability of the model to learn later on in sentences, as it would be getting contradictory signals (ie text that wasn't in the audio). However when I removed them, it made minimal improvement.

My training data does include a fair number of somewhat longer audio samples (ie perhaps 10 seconds in length), although the majority are typically 3-6 seconds.

Do you have advice about how best to help it learn the end of the sentences? If I have to record a lot more longer sentences, I can do so, but I was hoping there might be some parameters I could experiment with in the first instance.

Many thanks! Neil

nmstoker commented 6 years ago

Any advice on where I could start to figure this out?

I've now increased my training dataset to 4.5 hours of audio (single speaker = me!) I used checkpoints from different amounts of training with ljspeech to see if that helped but it seems to make little impact. Then I train for between 10 and 20 hours on my data, and it's still tailing off pretty quickly, going from clear to gibberish very quickly within a particular test phrase.

Factors I wondered about, but which seem unlikely to cause this include:

I could also start systematically trying to adjust the parameters, but that might take a long time, as it's not clear there's a problem until it has trained for a decent period.

What about increasing the window ahead? Any other suggestions?

G-Wang commented 6 years ago

Hello, I'm not sure about the particulars of your dataset, but I've been able to successfully adapt a British male voice from a trained LJ-Speech model (Nyanko build). I used Gentle aligner for generating my dataset, so there're no silences before/after sentences in the data.

What is the hyper-parameters you're using?

nmstoker commented 6 years ago

Thanks @G-Wang - the presets were based on deepvoice3_ljspeech.json (see below). I'm now giving it a go with Nyanko as you suggested to see if that improves things.

The only thing I'd tried increasing to help with the quality was encoder_channels and converter_channels (they didn't seem to affect it much for me to be honest!)

The full data is:

{
  "name": "deepvoice3",
  "frontend": "en",
  "replace_pronunciation_prob": 0.5,
  "builder": "deepvoice3",
  "n_speakers": 1,
  "speaker_embed_dim": 16,
  "num_mels": 80,
  "fmin": 125,
  "fmax": 7600,
  "fft_size": 1024,
  "hop_size": 256,
  "sample_rate": 22050,
  "preemphasis": 0.97,
  "min_level_db": -100,
  "ref_level_db": 20,
  "rescaling": false,
  "rescaling_max": 0.999,
  "allow_clipping_in_normalization": true,
  "downsample_step": 4,
  "outputs_per_step": 1,
  "embedding_weight_std": 0.1,
  "speaker_embedding_weight_std": 0.01,
  "padding_idx": 0,
  "max_positions": 1024,
  "dropout": 0.050000000000000044,
  "kernel_size": 3,
  "text_embed_dim": 256,
  "encoder_channels": 512,
  "decoder_channels": 256,
  "converter_channels": 512,
  "query_position_rate": 1.0,
  "key_position_rate": 1.385,
  "key_projection": true,
  "value_projection": true,
  "use_memory_mask": true,
  "trainable_positional_encodings": false,
  "freeze_embedding": false,
  "use_decoder_state_for_postnet_input": true,
  "pin_memory": true,
  "num_workers": 2,
  "masked_loss_weight": 0.5,
  "priority_freq": 3000,
  "priority_freq_weight": 0.0,
  "binary_divergence_weight": 0.1,
  "use_guided_attention": true,
  "guided_attention_sigma": 0.2,
  "batch_size": 16,
  "adam_beta1": 0.5,
  "adam_beta2": 0.9,
  "adam_eps": 1e-06,
  "initial_learning_rate": 0.0005,
  "lr_schedule": "noam_learning_rate_decay",
  "lr_schedule_kwargs": {},
  "nepochs": 2500,
  "weight_decay": 0.0,
  "clip_thresh": 0.1,
  "checkpoint_interval": 10000,
  "eval_interval": 10000,
  "save_optimizer_state": true,
  "force_monotonic_attention": true,
  "window_ahead": 3,
  "window_backward": 1,
  "power": 1.4
}
nmstoker commented 6 years ago

Hi @G-Wang - thanks a lot for your advice. Using Nyanko completely avoids the problem with the speech trailing off at the end. I need to experiment further for optimal results but it isn't at all like it was before - the ends of sentences are much like any other part of the output. I really appreciate the advice you gave!

G-Wang commented 6 years ago

Hey glad it worked for you. For me too I prefer Nyanko build as I think it converges to better sounding voices faster. Cheers

nmstoker commented 6 years ago

Closing as switching to Nyanko solved my issue

prajwalkr commented 5 years ago

Closing as switching to Nyanko solved my issue

I am training for another language and this thread helped me a lot. Could you please tell me how many iterations did it take with Nyanko to get intelligible speech?

Also, I am training with a batch size of 32, I hope that is alright.

mrgloom commented 5 years ago

@G-Wang When forced-aligner should be used? https://github.com/lowerquality/gentle Is trimming silence is not sufficient? https://github.com/r9y9/deepvoice3_pytorch/blob/master/vctk.py#L61