Open jd-3d opened 2 years ago
Didn't face such a problem. But you can try to add 0.5 seconds of silence when generating pieces. this can happen when we generate sound from a spectrogram. let me know if it helped you!
Thank you @e0xextazy, can you clarify how to add 0.5 seconds of silence during generation? I didn't see any setting for that in the API. I researched this a bit more and found that with some voices this problem does not exist, but with others it does. I think I found the culprit: Listen to the reference audio clips for train_empire, you will see the audio is arbitrarily cut off at the end while the speaker is mid-sentence. I believe this propagates into generation and there is a ~20%-30% chance it will mimic this 'cut-off' on any audio clip.
I am going to see if modifying the audio clips to have smooth ending without clipping a word helps. If it does this could be an easy way to improve the voices.
Ah that's a good finding! I will go through and clean up train_empire (and look for other voices that exhibit this).
Thank you @neonbjb! If you send me the cleaned up audio files for train_empire I can re-run my tests on my long paragraph and let you know if it fixes things.
Found three voices that had this problem, train_lescault, train_empire and train_mouse. Unfortunately for train_lescault I could not find the source material so I am not touching that one. The other two should be "fixed". Please let me know if that helps.
https://github.com/neonbjb/tortoise-tts/commit/550874cbece34afeec738e5fc99977eeb8585ec2
@neonbjb, I see you updated 2.mp3 from train_empire, but it seems the updated clip also is cut off (not as bad as before but still cut). Also the 1.mp3 is also cut off on the last word. I re-ran my example with the updated file but it seems the problem is still there. Shouldn't the voice clips have a smooth transition to silence? I'll try and download an audio editor and see if I can massage these audio files.
Unfortunately for train_lescault I could not find the source material so I am not touching that one.
I have located the source and recorded 5 new samples for this voice. Would uploading these clips help? Or do they have to be the precise clips?
@wavymulder Yep, saw your PR. Thank you very much!
@jd-3d I'm not sure we're talking about the same thing. I was specifically removing clips where the sentence gets cut off without even uttering the last word. 1.mp3 says "Under this holy sign, the peasants and burgers who were attached to the servitude of a glieb might escape from a haughty lord". Agreed that the word "lord" is slightly cut off, but most of the train_* voices have the same behavior. This is actually an artifact from the scripts I built to compile the training dataset. I guess I didn't put enough leadway at the end of the clipping logic. Nevertheless, since most clips the model was trained on are formatted like this, so I don't really think that it would explain the word-clipping behavior for a single voice.
I would love to hear any results you can provide with adding some extra silence at the end of the voice clips to see if that is reflected in the output.
@neonbjb, yes that very slight cut off of the word 'lord' for example is what I was referring to. Thanks for explaining the clipping logic in your scripts, that makes sense (would be great to re-visit that if you ever re-re-train, since I know you are already re-training).
I went ahead and modified the train_empire voice clips and cut them at clean points and added 0.5 sec of silence at the ends (see waveforms at the bottom). Using the new voice clips made a small improvement to the 'cutoff' phenomenon although it was still there slightly. That could be due to the training data using slightly cut-off clips and may affect other voices, not sure yet.
However, there was a huge difference in the timing/pacing of the generated audio. For the same paragraph the old generated audio was 122.7 seconds, and the new audio is 123.9 seconds. That may not sound like much, but had a big effect on the pauses between sentences. The new audio clip sounds much more natural, whereas the clip using the original audio files did not pause enough between sentences and commas. Could be a useful bit of info for people looking for longer pauses between sentences.
Before: After:
Experiencing the same issue with a custom voice
When using read.py on a paragraph, on each audio file the end word is cut short by around 0.2-0.5 sec (sometimes only speaking about half of the last word). Even in the combined wav file you can still clearly hear the cuts. I couldn't find any easy workaround. Is this a known issue? I was using train_empire voice and using the latest build. If anyone has a fix or workaround please let me know.