keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.95k stars 959 forks source link

Loss Exploded After 110k step and Bad output waves #266

Closed Liujingxiu23 closed 5 years ago

Liujingxiu23 commented 5 years ago

Language: Mandarin Chinese Dataset: Open Dataset From Biaobei Scripts: revise "preprocess.py" and "datafeeder.py" to process chinese . Other scripts keep same as the master branch.

Loss:

捕获

train.log: Step 113790 [0.869 sec/step, loss=0.08736, avg_loss=0.08593] Step 113791 [0.874 sec/step, loss=0.08801, avg_loss=0.08595] Step 113792 [0.878 sec/step, loss=0.08830, avg_loss=0.08599] Step 113793 [0.881 sec/step, loss=0.08872, avg_loss=0.08600] Step 113794 [0.882 sec/step, loss=0.08842, avg_loss=0.08601] Step 113795 [0.884 sec/step, loss=0.09743, avg_loss=0.08612] Step 113796 [0.880 sec/step, loss=nan, avg_loss=nan] Loss exploded to nan at step 113796! Exiting due to exception: Loss Exploded

Output: Waves synthesized while training all sound well. Some of the waves synthesized using "eval.py" sounds well too; but some waves(4/10) are just noise(For such waves, the input text is normal, not long sentences)

捕获1

My questions:

  1. why "Loss Exploded" happened?Is it overfit? If yes, how can I judge how many step is the best? Only through the quality of output waves? 2.For Biaobei Dataset, does the loss seems good? Does anyone has any experience?
  2. Why some output waves sound like noises
Liujingxiu23 commented 5 years ago

solved

xxoospring commented 5 years ago

i meet the same problem, could you tell me the reason?

YoungofNUAA commented 5 years ago

大佬可不可以分享一下你中文库的训练代码以及模型,感激不尽

gdineshk6174 commented 4 years ago

@Liujingxiu23 hi , can you tell me how you solved "loss exploded" problem...? thank you

yilmazay74 commented 4 years ago

Hi All, I got the same error message at 115k'th step. The output waves sounds nonsense and very short. Well, my training set consists of only 180 audio files. the audios' sample rate is 16k and the bit depth is 16 bit. The longest audio is 20 seconds. I made a few adjustments in the hparams.py file to get rid of input data shape mismatch error. The language of my training set is Turkish. My changes are: HPrama name: Original value: Changed value: cleaners english_cleaners transliteration_cleaner sample rate 20000 16000 frame_length_ms 50 100 frame_shift_ms 12,5 25 max_iters 200 400

Below is the all params: cleaners='transliteration_cleaners',

Audio:

num_mels=80, num_freq=1025, sample_rate=16000, frame_length_ms=100, frame_shift_ms=25, preemphasis=0.97, min_level_db=-100, ref_level_db=20,

Model:

outputs_per_step=5, embed_depth=256, prenet_depths=[256, 128], encoder_depth=256, postnet_depth=256, attention_depth=256, decoder_depth=256, epochs=100,

Training:

batch_size=32, adam_beta1=0.9, adam_beta2=0.999, initial_learning_rate=0.002, decay_learning_rate=True, use_cmudict=False,

Eval:

max_iters=200, griffin_lim_iters=60, power=1.5,

I know 180 files for training is too few, however, I was expecting at least the training would end without problems and it would create at least a less accurate model. Prior to this I trained my model with 40 files and it ended at 71000 ' th step and at least it can synthesize the texts in the training set. Can someone shade some light on the possible cause of this error? Secondly can someone tell me how the parameter values look? Any ideas on how to try different values of some tunable parameters to improve the accuracy? Thirdly with Tesla 80k GPU, the training took about 2 days. Any ideas on whether it is possible to shorten the overall traning time? I am sharing my train.log file for your reference. I will appreciate any help or recommendations. Best Regards train.log