Poor Alignment for 2 hours of audio over 500 epochs/ batch size 32

Hi everyone,

I'm quite new to voice synthesis so sorry if this is a question that has been answered before. I've produced a cleaned dataset of 2740 clips from 1-10 seconds in length (22050hz, 16-bit mono) and run the following command;

python train.py --output_directory=outdir --log_directory=logdir -c tacotron2_statedict.pt --warm_start --hparams=training_files=train.txt,validation_files=test.txt,batch_size=32

However, the results I've received are quite poor with speech produced almost unrecognizable as a human voice. results

The final checkpoint I've used for inference (iteration 35000) has a loss of 0.1739 & a grad norm of 0.3044.

I'd appreciate any suggestions as I haven't been able to track down any solutions I haven't already tried

keithito / tacotron

Poor Alignment for 2 hours of audio over 500 epochs/ batch size 32 #352