r9y9 / deepvoice3_pytorch

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models
https://r9y9.github.io/deepvoice3_pytorch/
Other
1.97k stars 485 forks source link

How to prepare long audio file for TTS? #168

Closed mrgloom closed 5 years ago

mrgloom commented 5 years ago

How to prepare long audio file for TTS? i.e. as I understand we need to cut long audio to sentences? For example https://americanrhetoric.com/barackobamaspeeches.htm or audiobooks (as I understand ljspeech dataset is originally an audiobook https://keithito.com/LJ-Speech-Dataset/)

tripzero commented 5 years ago

I used ffmpeg to slice up the audio file into 10s clips. Then I used deepspeech to get text for each clip and then wrote some script using python fuzzy to find the "best match" from the text in the book.

Can't say I've had much luck training, however. I'm down to 0.1585 loss and nothing I synthesize with the checkpoints produce anything coherent. I wonder what loss I should be training for.

tripzero commented 5 years ago

As an update to what I've done which has worked much better: instead of cutting the long audio clip into 10s segments, I cut by silence detection in ffmpeg. Then run through deepspeech and run my "best match" script to correct the text for each segment. Only problem now is that the beginning and ending of each segment are cut at the beginning of the word but that's fixable.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.