Closed BenAAndrew closed 3 years ago
Hi everyone,
I'm quite new to voice synthesis so sorry if this is a question that has been answered before. I've produced a cleaned dataset of 2740 clips from 1-10 seconds in length (22050hz, 16-bit mono) and run the following command;
python train.py --output_directory=outdir --log_directory=logdir -c tacotron2_statedict.pt --warm_start --hparams=training_files=train.txt,validation_files=test.txt,batch_size=32
However, the results I've received are quite poor with speech produced almost unrecognizable as a human voice.
The final checkpoint I've used for inference (iteration 35000) has a loss of 0.1739 & a grad norm of 0.3044.
I'd appreciate any suggestions as I haven't been able to track down any solutions I haven't already tried
Hi everyone,
I'm quite new to voice synthesis so sorry if this is a question that has been answered before. I've produced a cleaned dataset of 2740 clips from 1-10 seconds in length (22050hz, 16-bit mono) and run the following command;
python train.py --output_directory=outdir --log_directory=logdir -c tacotron2_statedict.pt --warm_start --hparams=training_files=train.txt,validation_files=test.txt,batch_size=32
However, the results I've received are quite poor with speech produced almost unrecognizable as a human voice.![results](https://user-images.githubusercontent.com/35925918/94602425-be850e00-028c-11eb-9fc3-66d8bd4c298f.png)
The final checkpoint I've used for inference (iteration 35000) has a loss of 0.1739 & a grad norm of 0.3044.
I'd appreciate any suggestions as I haven't been able to track down any solutions I haven't already tried