Issue training with DeepVoice3 model with LJSpeech Data

timbrucks commented 6 years ago

Thanks for your excellent implementation of Deep Voice 3. I am attempting to retrain a DeepVoice3 model using the LJSpeech data. My interest in training a new model is that I want to make some small model parameter changes in order to enable fine-tuning using some Spanish data that I have.

As a first step I tried to retrain the baseline model and I have run into some issues.

With my installation, I have been able to successfully synthesize using the pre-trained DeepVoice3 model with git commit 4357976 as your instructions indicate. That synthesized audio sounds very much like the samples linked from the instructions page.

However, I am trying to train now with the latest git commit (commit 48d1014, dated Feb 7). I am using the LJSpeech data set downloaded from the link you provided. I have run the pre-processing and training steps as indicated in your instructions. I am using the default preset parameters for deepvoice3_ljspeech.

I have let the training process run for a while. When I synthesize using the checkpoint saved at 210K iterations, the alignment is bad and the audio is very robotic and mostly unintelligible.

0_checkpoint_step000210000_alignment

When I synthesize using the checkpoint saved at 700K iterations, the alignment is better (but not great); the audio is improved but still robotic and choppy.

0_checkpoint_step000700000_alignment

I can post the synthesized wav files via dropbox if you are interested. I expected to have good alignment and audio at 210K iterations as that is what the pretrained model used.

Any ideas what has changed between git commits 4357976 and 48d1014 that could have caused this issue? When I diff the two commits, I see some changes in audio.py, some places where support for multi-voice has been added, and some other changes I do not yet understand. There are some additions to hparams.py, but I only noticed one difference: in the current commit, masked_loss_weight defaults to 0.5, but in the prior commit the default was 0.0.

I have just started a new training run with masked_loss_weight set to 0.0. In the meantime, do you have thoughts on anything else that might be causing the issues I am seeing?

r9y9 commented 6 years ago

Thank you for the detailed report. If everything goes good, you should get reasonable results with 200k steps, so there must be something wrong in your case. I would like to investigate the problem but I am currently in a business trip and won’t have time for a week. I believe I didn’t change anything performance critical but I might miss something important... masked_loss_weight was not a sensible parameter from my experience.

imdatceleste commented 6 years ago

@timbrucks : Same problem here. Maybe it is related to #38 ?

r9y9 commented 6 years ago

Could anyone try to revert https://github.com/r9y9/deepvoice3_pytorch/commit/421e8b72b975ea285283e852a47ae87a329db51f and see if it works?

timbrucks commented 6 years ago

I have reverted to 421e8b7 and started a training run. I will let you know what happens.

amilamad commented 6 years ago

Hi, This not relevant to the problem you have. I tried to do a synthesizing but it failed. I used the same instructions as in https://github.com/r9y9/deepvoice3_pytorch/tree/43579764f35de6b8bac2b18b52a06e4e11b705b2

But I get a error. Please help.

r9y9 commented 6 years ago

Looks like dup of #37. Also please do not comment on the unrelated issue. Create a new one if necessary.

amilamad commented 6 years ago

Ops, I`m new to git hub. That issue is the same as mine. Thanks for the support :)

imdatceleste commented 6 years ago

@r9y9 : Also reverted to 421e8b7 and training... will let you know asap... No changes, eval still doesn't deliver any results (see alignment).

Training Alignment (Ave): step000060000_alignment

Eval Alignment at same (60000 steps): step000060000_text4_single_alignment

timbrucks commented 6 years ago

I allowed my training process for 421e8b7 to run up to 400k iterations. The sample WAV files produced in the checkpoints folder are pretty good (a little more reverb / metallic sound than expected) and the alignment is pretty good as well. Here is the alignment (from the "alignment_ave" dir) at 300k iterations for the utterance "that an anti-Castro organization had maintained offices there for a period ending early in 1962"

step000300000_alignment

Here is a zip of the predicted WAV file at 300k iterations:

step000300000_predicted.wav.zip

However, if I use the stored checkpoint to synthesize that same utterance, the alignment looks like this and the WAV file produced is unintelligible.

0_checkpoint_step000300000_alignment

So at this point I am wondering if the issue is something in the synthesis process ...

r9y9 commented 6 years ago

Hi, @homink, I am wondering if you hit the issue while working on #44. Could you give us some insight?

timbrucks commented 6 years ago

As an additional experiment, I reset my git commit to 4357976 and attempted to train the LJSpeech model. I stopped the training at ~370K iterations. I see similar behavior as mentioned above: the predicted WAV files sounds good, but the synthesized result is much lower quality.

This is the command I used, which based on the looking at hparams.py for that specific commit seems like the correct way to get the preset parameters for LJSpeech.

python train.py --hparams "use_preset=True,builder=deepvoice3" --data-root <path to the ljspeech feats produced by preprocess.py> --checkpoint-dir <path checkpoint dir> --log-event-path <path log dir>

I again suspected something was amiss with the synthesis process. But with the same environment / setup, when I synthesize using the pre-trained LJSpeech model, the output sounds great - just like the samples provided.

At this point, all I can think of is that somehow I hosed up the pre-processing of the LJSpeech data ... maybe I ran pre-processing when I had the git pointed at a later commit? So I reran the preprocess step and have just started another training run using commit 4357976.

Any other ideas?

homink commented 6 years ago

Hi All,

I believe the quality drop is really tough to figure out. I guess 2 things - random seed and silence of audio - would cause such degradation. For random seed, If I checked correctly, I was not able to find random seed initialization, which will be differently set up and will yield different training outputs. Silence of audio in the beginning and in the end also could affect model training. Here is what I had experienced so far.

I only tried JSUT and NIKL using two commits: aeed2 and ed38d. I only checked the prediction wav files which were produced on training. These wav files sounded gradually better as iteration went by. For JSUT, most of these wav files made good sound with high iteration around 370K. But, for NIKL, a few wav files made good sound with high iteration around 400K. I thought this would be caused by some parameter tuning and I didn't investigate further. Also, untrimmed audio in the beginning and in the end, especially in NIKL, was really bad even with the high iteration. Thus I trimmed and left margin about 100 msec in the beginning and in the end. Details are here.

timbrucks commented 6 years ago

I was able to successfully train a model using 421e8b7. My first attempt (described above) did not work, but after rerunning the pre-processing the results are much better, very similar to the sample results. Somehow I must have hosed up the pre-processing step (not sure how?).

I will now try to train using the recent commit 48d1014 (dated Feb 7).

r9y9 commented 6 years ago

Finaly I'm back. I will do some experiments locally soon. At the moment I am guessing https://github.com/r9y9/wavenet_vocoder/issues/22 is the bug I have to fix, which was introduced at https://github.com/r9y9/deepvoice3_pytorch/commit/421e8b72b975ea285283e852a47ae87a329db51f.

timbrucks commented 6 years ago

The results of training with with 48d1014 (now that I have good pre-processed data) were better. One of the synthesized utterances is bad, but several were good. I will give this latest commit a try.

r9y9 commented 6 years ago

OK, now I can reproduce. Looking into it...

nikita-smetanin commented 6 years ago

Can confirm the bug with 2987b76 — sound quality after 210k iterations on LJSpeech is far from examples at r9y9.github.io/deepvoice3_pytorch

Will 18bd61d fix it? Were VCTK and Nyanko models affected too?

r9y9 commented 6 years ago

No, the bug persists. I'm looking into it.

r9y9 commented 6 years ago

A little progress:

diff --git a/audio.py b/audio.py
index 0decdbc..53fa56c 100644
--- a/audio.py
+++ b/audio.py
@@ -45,7 +45,7 @@ def inv_spectrogram(spectrogram):

 def melspectrogram(y):
     D = _lws_processor().stft(preemphasis(y)).T
-    S = _amp_to_db(_linear_to_mel(np.abs(D))) - hparams.ref_level_db
+    S = _amp_to_db(_linear_to_mel(np.abs(D)))  # - hparams.ref_level_db
     if not hparams.allow_clipping_in_normalization:
         assert S.max() <= 0 and S.min() - hparams.min_level_db >= 0
     return _normalize(S)
@@ -69,18 +69,15 @@ def _linear_to_mel(spectrogram):

 def _build_mel_basis():
-    assert hparams.fmax <= hparams.sample_rate // 2
-    return librosa.filters.mel(hparams.sample_rate, hparams.fft_size,
-                               fmin=hparams.fmin, fmax=hparams.fmax,
-                               n_mels=hparams.num_mels)
+    return librosa.filters.mel(hparams.sample_rate, hparams.fft_size, n_mels=hparams.num_mels)

 def _amp_to_db(x):
-    return 20 * np.log10(x + 0.01)
+    return 20 * np.log10(np.maximum(1e-5, x))

 def _db_to_amp(x):
-    return np.maximum(np.power(10.0, x * 0.05) - 0.01, 0.0)
+    return np.power(10.0, x * 0.05)

with this I can get reasonable quality after 100k steps. Will look into it further.

r9y9 commented 6 years ago

I think #46 should fix this. I'm training a model from scratch to confirm if it actually fixes the problem.

rafaelvalle commented 6 years ago

@r9y9 With Tacotron 2 I've been able to get reasonable quality with 20k iterations using the log of the clipped magnitudes instead of the representation with the data normalized to [0, 1].

r9y9 commented 6 years ago

Sorry for the bug. I think I fixed the problem. Feel free to reopen if you still see the bug.

timbrucks commented 6 years ago

Based on my tests, I think you fixed the problem. Thanks!

r9y9 / deepvoice3_pytorch

Issue training with DeepVoice3 model with LJSpeech Data #43