Closed kanjieater closed 8 months ago
The method use by stable-ts is autoregressive. So if a word is not aligned properly and occupies the time slot that the next word is supposed to be in, then this next word might take up the time slots of the word that follows. This can start a domino effect that leads to align()
reaching the end of the audio and still have words without alignment, but there are safeguards in place that prevent this by allowing words (with its time slots possibly taken by previous words) to have no duration.
A way to improve the accuracy of align()
is clean the text
and audio
: remove any content from the text
that isn't part of what is actually spoken in the audio (e.g. remove comments, notes, and etc.); remove any sounds from audio that isn't speech (e.g. music and noise), which can be done with arguments such as demucs=True
and only_voice_freq=True
.
By default, align()
will skip sections of the audio that have be detected to be silent/non-speech, but there can be false positives (especially when the audio is not clean) which means it will skip sections that contain speech. You can disable this skipping with nonspeech_skip=None
, or make it less than likely to skip by setting a high duration (e.g. nonspeech_skip=10
) and using vad=True
.
To speed up alignment, you can use the quantized 8bit model from faster-whisper with stable-ts by installing faster-whisper and loading it with stable_whisper.load_faster_whisper('large-v2', compute_type="int8_float16")
. In the ideal case that both the text
is clean and audio
is mostly clean, fast_mode=True
can work extremely well in terms of speed and accuracy.
Unfortunately using demucs or only_voice_freq immediately spikes into over 20 GB of memory when using the 19hr audio file I'm testing. I'm guessing there's no way around that?
Unfortunately using demucs or only_voice_freq immediately spikes into over 20 GB of memory when using the 19hr audio file
The long duration of the audio causes the spikes because demucs
and only_voice_freq
are applied to the entire audio track. For now, the way to reduce the spikes is to split the the audio into parts. There are plans for supporting continuous audio streams. Once that is implemented, the duration shouldn't matter.
Ok - I think I will have to wait until then, as splitting audiobooks and recombining them caused a drop in quality and some desyncronization issues that were challenging. Thanks as always for the explanations.
I do think there's a way to get better alignment with smaller models, given that I'm using a tiny model with a good (although complex) alignment algorithm, and getting very accurate results.
In my v1.0.0 of my tool, I am able to get accurate alignment by using only a
tiny
model. https://github.com/kanjieater/SubPlease/tree/v1.0.0I'm finding that with the built in stable-ts align, that unless I use a large model, it will fail to align large portions. But stable-ts using the whisper model to align, seems to be much more performant still. I'm wondering if there's anything I can share or learn about the align to make it more accurate, but also more performant.
Basically this script just does what stable-ts align's use-case does, takes a text file and matches it to audio to generate a subtitle.
https://github.com/kanjieater/SubPlease/blob/v1.0.0/align.py
I did not write the align code and find it a little difficult to understand, but the it does a good job of finding the best match in the script and aligning it to the generated whisper sub.
My question is, is there any way to tune or enhance stable-ts to allow me to not use my align code. I'm not sure if it just comes down to needing a larger "match window" or looser fuzzy matching to make using stable-ts as accurate.