jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.52k stars 172 forks source link

Optimizing the 'Align' Feature for Accurate Audio-Text Synchronization #331

Open zxl777 opened 6 months ago

zxl777 commented 6 months ago

How can I ensure that the "Align" feature, which aligns plain text or tokens with audio at the word level, avoids outputting accidental errors? This feature is great because it can output final results based on edited line breaks, and I can import recognition results without word timestamps from various models, which is very flexible.

In practical tests, sometimes the output is perfect. However, sometimes a few sentences are missed, or the timestamps are disordered.

What input conditions should I pay attention to in order to ensure perfect output?

jianfch commented 6 months ago

See https://github.com/jianfch/stable-ts/issues/296#issuecomment-1891080576.

Other tip is try to smaller models such base, you might find better result with those.

zxl777 commented 6 months ago

@jianfch I recently attempted to use load_model('small.en'), and the results were flawless. However, when I tried load_model('medium.en'), it consistently missed some sentences.

# model = stable_whisper.load_model('medium.en')
model = stable_whisper.load_model('small.en')
result = model.align(audio, text, language='en',original_split=True)
jianfch commented 6 months ago

The chosen alignment heads of medium.en might simply be not as reliable as the ones of small.en.

zxl777 commented 6 months ago

@jianfch My recent discovery is that using small.en sometimes results in the occasional duplication of sentences, whereas using medium.en can lead to missing sentences. Therefore, the issues with small.en are relatively minor.

It would be great if the 'Align' feature could be further optimized.

zxl777 commented 6 months ago

@jianfch I suspect that the issue of having an extra sentence or missing one in the aligned results is due to the characteristics of the model and occasional recognition errors.

To address this, could we use two models to align separately and then integrate them together to eliminate the errors? Since alignment is fast, taking only a few seconds, it's worth trying to align twice.

jianfch commented 6 months ago

To address this, could we use two models to align separately and then integrate them together to eliminate the errors?

This could work in theory, but it requires a reliable way to autodetect the errors.

jianfch commented 6 months ago

extra_models introduced in 5513609b33935192cc54432bd224abd04b535965 will compute the timestamps from the average of all the extra_models and the model.

model = stable_whisper.load_model('base')
extra_models = [stable_whisper.load_model(name) for name in ['base.en', 'small', 'small.en', 'tiny', 'tiny.en']]
result = model.transcribe('audio.wav', extra_models=extra_models)
zxl777 commented 5 months ago

I ultimately found the reason for the issue—it was a problem with the initial text input, which contained extra sentences. When stable-ts performing alignment, it can sometimes correct errors and sometimes it cannot.

However, as long as the source problem is resolved, the alignment is always correct.