Closed mrgloom closed 5 years ago
I used ffmpeg to slice up the audio file into 10s clips. Then I used deepspeech to get text for each clip and then wrote some script using python fuzzy to find the "best match" from the text in the book.
Can't say I've had much luck training, however. I'm down to 0.1585 loss and nothing I synthesize with the checkpoints produce anything coherent. I wonder what loss I should be training for.
As an update to what I've done which has worked much better: instead of cutting the long audio clip into 10s segments, I cut by silence detection in ffmpeg. Then run through deepspeech and run my "best match" script to correct the text for each segment. Only problem now is that the beginning and ending of each segment are cut at the beginning of the word but that's fixable.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
How to prepare long audio file for TTS? i.e. as I understand we need to cut long audio to sentences? For example https://americanrhetoric.com/barackobamaspeeches.htm or audiobooks (as I understand ljspeech dataset is originally an audiobook https://keithito.com/LJ-Speech-Dataset/)