jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.54k stars 172 forks source link

align() is not removing silences before words #319

Closed EtienneAb3d closed 7 months ago

EtienneAb3d commented 7 months ago

Hi,

It seems align() is starting a word at the exact end of the previous one. This causes silences to be included in their following words.

In this example (French text), "Bien" is said to be a 2.4s word, but it's including at least 1.5s of silence at its beginning.

8
00:00:11.580 --> 00:00:12.640
univers

9
00:00:12.640 --> 00:00:15.020
Bien

10
00:00:15.020 --> 00:00:15.640
avant

Perhaps it's linked: I also find that each word is ending a bit too early, perhaps 200ms.

(Tested with large-V2)

jianfch commented 7 months ago

A word will always use the end of the previous word as the start. The gap between words are either due to punctuations occupying that gap or timestamp adjustments based on the detected nonspeech_sections (i.e. the gap is added after the fact).

Since there are no punctuations before "Bien" it will have to rely on the nonspeech/silence detection to detect the 1.5s of silence. You can check what gaps has been detected by printing out result.nonspeech_sections If the 1.5s of silence is found in nonspeech_sections then it could be a bug or unsatisfied requirements that prevent the adjustments. If it is not found, it simply wasn't detected. For this, you can try to specify a denoiser (e.g. denoiser="demucs") to remove any background noise or use vad=True which is a more robust nonspeech detection method.

EtienneAb3d commented 7 months ago

Before "Bien" there is a new line char '\n'. Is such new line char used as a possible punctuation? It's nearly always the case with lyrics. Without this, the first word of quite all lines will start too early.

The silence is well detected, but not removed from the word range:

image

image

jianfch commented 7 months ago

That section is detected but ignored because does meet the nonspeech_error threshold. https://github.com/jianfch/stable-ts/blob/00ad4b45d314eadedc59b2ede7ab034ef14a5131/stable_whisper/alignment.py#L145-L146

"Bien" is from 12.64 to 15.02 but the detected nonspeech section, 12.98 to 14.68, is in between the word range so it can't simply remove that section. This part of readme goes more in depth: https://github.com/jianfch/stable-ts?tab=readme-ov-file#silence-suppression

But in short, you can set a higher value than the default 0.1 for nonspeech_error. The value you set essentially reflects how much you want to rely on the detected nonspeech_sections. Generally, high values work well with clean audio (i.e. audio with no other sounds except speech and the volume relatively consistent).

EtienneAb3d commented 7 months ago

Ok, but, the problem doesn't occur when using transcribe(). So there's a difference somewhere that's certainly worth investigating.

image

jianfch commented 7 months ago

Differences between word timestamps produced by transcribe() and align() are expected because they use different heuristics for aligning the words.

The gap in the result of transcribe() is likely because that segment begins with "Bien" and a combination of any of the following can produce this gap:

For align(), the input text is treated as one big segment and it only split into segments after alignment is complete. Since there is no gap, it will not split at "Bien". You can use original_split=True to split the result based on the "\n" in text. https://github.com/jianfch/stable-ts/blob/00ad4b45d314eadedc59b2ede7ab034ef14a5131/stable_whisper/alignment.py#L92-L93

But even then, clamp_max() may not necessarily clip the start of "Bien" because its duration does not exceed the threshold.

EtienneAb3d commented 7 months ago

On the same kind of idea and consequences (same file I sent to you), do you know why align() is not detecting the 3 last silences, well detected by transcribe()? image

jianfch commented 7 months ago

Thanks for report this bug. It should be fixed after 424f4842d91c0fceb66ac96aba43c41cb30275b3.

Generally, slight differences in the nonspeech_sections between align() and transcribe() is normal because of the default nonspeech predictor does localized predictions and then merges all the predictions when the task is complete. The localized predictions can vary depending on how the transcribe()/align() slides across the audio which why the final nonspeech_sections can also vary.