Ittiz commented 5 years ago

So I downloaded the 441k pretrained model and it sounds OK. However has issues with short phrases where it repeats itself. Anyway I thought I'd train it some more to see if it would have any improvements. So I download the LJs data set you provided, but the model actually starts to sound worse (muffled as if over an old cellphone) and tends to repeat itself more after additional training. These are the Hparams I used:

Audio:

num_mels=80, num_freq=1025, sample_rate=22050, frame_length_ms=50, frame_shift_ms=12.5, preemphasis=0.97, min_level_db=-100, ref_level_db=20,

Model:

outputs_per_step=5, embed_depth=256, prenet_depths=[256, 128], encoder_depth=256, postnet_depth=256, attention_depth=256, decoder_depth=256,

Training:

batch_size=32, adam_beta1=0.9, adam_beta2=0.999, initial_learning_rate=0.002, decay_learning_rate=True, use_cmudict=False, # Use CMUDict during training to learn pronunciation of ARPAbet phonemes

Eval:

max_iters=170, griffin_lim_iters=60, power=1.5, # Power to raise magnitudes to prior to Griffin-Lim

Ittiz commented 5 years ago

nothing? Could someone share the hparams used to train the pretarined model? I'm getting sick of my stuff sounding like a crappy recording from a mid 90ies video game.

keithito commented 5 years ago

Can you attach some examples, and the command line you're using to continue training?

Ittiz commented 5 years ago

I'll see if I can get some, in the mean time I've also found bad data in the LJs dataset. When I was training on it I noticed an image that had a bad alignment long after it was properly aligned. So I looked up the corresponding training wav file and text only to find they weren't a complete match. So at the moment I'm having the computer generate an alignment for every file so I can find the files that don't match the text. Later I will release a corrected data set.

keithito / tacotron

441k pretrain LJs model #264

Audio:

Model:

Training:

Eval: