Closed EtienneAb3d closed 8 months ago
A word will always use the end of the previous word as the start. The gap between words are either due to punctuations occupying that gap or timestamp adjustments based on the detected nonspeech_sections
(i.e. the gap is added after the fact).
Since there are no punctuations before "Bien" it will have to rely on the nonspeech/silence detection to detect the 1.5s of silence.
You can check what gaps has been detected by printing out result.nonspeech_sections
If the 1.5s of silence is found in nonspeech_sections
then it could be a bug or unsatisfied requirements that prevent the adjustments.
If it is not found, it simply wasn't detected. For this, you can try to specify a denoiser
(e.g. denoiser="demucs"
) to remove any background noise or use vad=True
which is a more robust nonspeech detection method.
Before "Bien" there is a new line char '\n'. Is such new line char used as a possible punctuation? It's nearly always the case with lyrics. Without this, the first word of quite all lines will start too early.
The silence is well detected, but not removed from the word range:
That section is detected but ignored because does meet the nonspeech_error
threshold.
https://github.com/jianfch/stable-ts/blob/00ad4b45d314eadedc59b2ede7ab034ef14a5131/stable_whisper/alignment.py#L145-L146
"Bien" is from 12.64 to 15.02 but the detected nonspeech section, 12.98 to 14.68, is in between the word range so it can't simply remove that section. This part of readme goes more in depth: https://github.com/jianfch/stable-ts?tab=readme-ov-file#silence-suppression
But in short, you can set a higher value than the default 0.1
for nonspeech_error
. The value you set essentially reflects how much you want to rely on the detected nonspeech_sections
. Generally, high values work well with clean audio (i.e. audio with no other sounds except speech and the volume relatively consistent).
Ok, but, the problem doesn't occur when using transcribe()
. So there's a difference somewhere that's certainly worth investigating.
Differences between word timestamps produced by transcribe()
and align()
are expected because they use different heuristics for aligning the words.
The gap in the result of transcribe()
is likely because that segment begins with "Bien" and a combination of any of the following can produce this gap:
gap_padding='...'
, the beginning of each segment is padded with '...'
for alignment which acts has an invisible punctuation that can occupy the short gap before the first word if there is one. clamp_max()
which by default clips the start of the first word if its duration exceeds a threshold.For align()
, the input text
is treated as one big segment and it only split into segments after alignment is complete. Since there is no gap, it will not split at "Bien". You can use original_split=True
to split the result based on the "\n" in text
.
https://github.com/jianfch/stable-ts/blob/00ad4b45d314eadedc59b2ede7ab034ef14a5131/stable_whisper/alignment.py#L92-L93
But even then, clamp_max()
may not necessarily clip the start of "Bien" because its duration does not exceed the threshold.
On the same kind of idea and consequences (same file I sent to you), do you know why align()
is not detecting the 3 last silences, well detected by transcribe()
?
Thanks for report this bug. It should be fixed after 424f4842d91c0fceb66ac96aba43c41cb30275b3.
Generally, slight differences in the nonspeech_sections
between align()
and transcribe()
is normal because of the default nonspeech predictor does localized predictions and then merges all the predictions when the task is complete. The localized predictions can vary depending on how the transcribe()
/align()
slides across the audio which why the final nonspeech_sections
can also vary.
Hi,
It seems
align()
is starting a word at the exact end of the previous one. This causes silences to be included in their following words.In this example (French text), "Bien" is said to be a 2.4s word, but it's including at least 1.5s of silence at its beginning.
Perhaps it's linked: I also find that each word is ending a bit too early, perhaps 200ms.
(Tested with large-V2)