jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.54k stars 172 forks source link

Wrong subtitles shown during silence #380

Open Crypto90 opened 3 months ago

Crypto90 commented 3 months ago

I run into a weird issue where I can't find the cause of it (using latest git version and model: large-v3). I generate translated subtitles from a danish audio file. The word level timings are pretty accurate. But in some cases, It creates/imagines a "first word" where it should be silent and this "first word" starts a new segment. In the following example, it imagines "It's" for the time "00:00:09,180 --> 00:00:09,600". which started a new segment. Then for the whole time of subtitle index 9, for the duration of "00:00:09,600 --> 00:01:31,480" it shows the subtitle without any highlighted word, the whole time, which is wrong.

so in sum:

how can i prevent this? I experimented a lot with the VAD settings, but could not get this to get fixed. Any help is much appreciated!

8
00:00:09,180 --> 00:00:09,600
<font color="#1f4123">[zzz321]</font> <font color="#1f4123">It's</font> Jew, South Africa, Junk Jews.

9
00:00:09,600 --> 00:01:31,480
<font color="#1f4123">[zzz321]</font> It's Jew, South Africa, Junk Jews.

10
00:01:31,480 --> 00:01:31,720
<font color="#1f4123">[zzz321]</font> It's <font color="#1f4123">Jew,</font> South Africa, Junk Jews.
result = model.transcribe_stable(filename, task="translate", language="da", word_timestamps=True, vad_filter=True, vad_parameters=dict(threshold=0.8, min_silence_duration_ms=800, min_speech_duration_ms=300), suppress_silence=True, best_of=10, beam_size=10, min_word_dur=0.1, no_speech_threshold=0.8, vad_threshold=0.8, condition_on_previous_text=False, suppress_tokens=[], regroup=False, use_word_position=True)

Snag_105046f

jianfch commented 3 months ago

The "It's" is the result of the chosen model and settings. Changing the model or the settings (e.g. reducing beam_size and best_of) can produce overall different translation which will not have the first "It's". The other settings that affect the translation for Faster-Whisper models are vad_filter and vad_parameters. When vad_filter=True, Faster-Whisper will only translated the portions of the audio that contains voice as determined by the VAD. As a result, the vad_parameters will affect which portions of the audio gets translated and potentially produce different translations.

The audio is translated then the word timestamps are computed after the fact. With less reliable word timestamp of translations, we can end up with edge case: a word starts before the a silent section and ends before the silent section ends, and so the next word starts before the end of that silent section. This would cause use_word_position=True to fail because it treats the silent as if it was inbetween the two words instead of before the first word. So something like this likely happened: original timestamps of "It's": 00:00:09,180 --> 00:01:30,000 original timestamps of "Jew": 00:01:30,000 --> 00:01:31,720 silent section detected by VAD: 00:00:09,600 --> 00:01:31,480

That line is shown from 00:00:09,600 --> 00:01:31,480 because gaps are within a segment only causes words to be not highlighted during those gaps. Only gaps between segments are not shown, or else the small gaps between words will cause the whole segment to flicker in and out constantly. By default, segments with gaps larger than 0.5 second are split into smaller segments because regroup=True. However, you used regroup=False so the large gap from 00:00:09,600 to 00:01:31,480 did not cause "It's" to split off into its own segment. If you want to split at gaps like regroup=True, but do not want its other regrouping steps, you can use regroup="sg=0.5"

Crypto90 commented 3 months ago

The split by gap helps with the issue point:

Then i get two subtitle phrases.

"It's": 00:00:09,180 --> 00:01:30,000 -- silence nothing shown -- "Jew": 00:01:30,000 --> 00:01:31,720

But the "It's" is still on the wrong time, it should be part of (or directly before) "00:01:31,480 --> 00:01:31,720" and not before the long silence pause.

I have a similar second case. With split by gap set to 4 seconds, as a value like 0.5 or 1.0 breaks basically every sentences apart, which i want to prevent. I want to keep the sentences untouched as they come.

I get a result like this:

12
00:00:07,800 --> 00:00:08,260
<font color="#7767a3">[alexxxx]</font> <font color="#7767a3">Are</font>

13
00:01:13,540 --> 00:01:14,100
<font color="#7767a3">[alexxxx]</font> <font color="#7767a3">we</font> ready?

14
00:01:14,100 --> 00:01:14,300
<font color="#7767a3">[alexxxx]</font> we <font color="#7767a3">ready?</font>

so the main issue here, is the first "Are", which is again WAY off. The person is saying "are we ready?" but he starts speaking this at around "00:01:13,540" and not "00:00:07,800 --> 00:00:08,260" which is totally wrong.

So there is some issue here, with the timestamp of the first word. It only appears sometimes, but it happens.

I think it happens in combination with VAD enabled. So somehow sometimes the first word time position gets BEFORE the silence (wrong) instead of after the silence (correct).

Crypto90 commented 3 months ago

I just run another test, now with

vad_filter=False

with disabled vad_filter, the subtitles are correct in time but with tons of hallucination. the "Are" is not "way off" like before with vad_filter=True. So there is an issue with positioning the first word starting a phrase before the detected VAD silence instead of correctly positioning the first word after the silence.

28
00:01:12,140 --> 00:01:12,340
<font color="#7767a3">[alexxxx]</font> <font color="#7767a3">.org community Are</font> we ready, Havaa?cze

29
00:01:12,340 --> 00:01:14,120
<font color="#7767a3">[alexxxx]</font> .org community Are <font color="#7767a3">we</font> ready, Havaa?cze

30
00:01:14,120 --> 00:01:14,300
<font color="#7767a3">[alexxxx]</font> .org community Are we <font color="#7767a3">ready,</font>
jianfch commented 3 months ago

Word-level timing for translations are prone to these issues because the timestamp adjustment step requires reliable timestamps. I'd suggest disabling the adjustment step with suppress_silence=False or only work with segment-level timestamps with word_timestamps=False.

Crypto90 commented 3 months ago

Using suppress_silence=False does not make any difference in this issue case with translate.

As long as vad_filter=True is set, splitted segments like:

I
have it, I have it.
...
The
other one was B

will get produced instead of:

I have it, I have it.
...
The other one was B

I tested all kind of parameters and the only parameter which creates this issue case is vad_filter=True. no matter what suppress_silence, suppress_word_ts, vad and use_word_position is set.

By enabling vad_filter, the time between "I" and "have" increases A LOT. I run into cases where 10-20 seconds getting in between, which produces:

I also looked through the code, there is no real logic existing to prevent this case.

My usecase is:

right now i am using the following regrouping:

result.clamp_max()
result.split_by_punctuation(['.', '。', '?', '!', '?'])
result.split_by_gap(4.0)

I need a way to prevent this issue for the "translate" results.

i also tried locking the first words of each segments but with mixed good and bad results, so till now, no real solution with lock() (which also is no solution, the wrong first word timing caused by vad_filter=True still exists then).

Crypto90 commented 3 months ago
31
00:04:28,740 --> 00:04:29,180
I

32
00:04:35,100 --> 00:04:37,240
have it, I have it.

nonspeech_sections printed for this time frame: {'start': 253.0, 'end': 267.94}, {'start': 268.26, 'end': 268.34}, {'start': 268.52, 'end': 276.74},

so the "I" is located in the nonspeech_section for {'start': 268.52, 'end': 276.74},

jianfch commented 2 months ago

I tested all kind of parameters and the only parameter which creates this issue case is vad_filter=True. no matter what suppress_silence, suppress_word_ts, vad and use_word_position is set.

Then this seems to be an issue caused by Faster-Whisper's implementation of vad_filter. When suppress_silence=False, the timestamps remain mostly unaltered as return by faster_whisper.WhisperModel.transcribe().

so the "I" is located in the nonspeech_section for {'start': 268.52, 'end': 276.74},

The nonspeech_sections are only used for trimming the duration of words and segments after Faster-Whisper returns the timestamps (i.e. it won't extend the end of the "I" to start of "have"). So what has likely occurred is that Faster-Whisper returned "I" with timestamps that happened to overlap the nonspeech_sections computed by Stable-ts. You can confirm this by calling the faster_whisper.WhisperModel.transcribe() directly (which is just model.transcribe() instead of model.transcribe_stable() without the Stable-ts parameters) then checking the timestamps.

result_gen, info = model.transcribe('audio.wav', vad_filter=True, word_timestamps=True, task='translate')
segments = []
for segment in result_gen:
    segment = segment._asdict()
    segment['words'] = [w._asdict() for w in words]
    segments.append(segment)
result = stable_whisper.WhisperResult(segments)

If this is the case, you can try to change the default settings for vad_parameters for Faster-Whisper. https://github.com/SYSTRAN/faster-whisper/blob/d57c5b40b06e59ec44240d93485a95799548af50/faster_whisper/vad.py#L25-L35 For example:

result = model.transcribe_stable(..., vad_parameters=dict(min_speech_duration_ms=1000))

If changing vad_parameters do not help, I'd suggest submitting an issue on Faster-Whisper's repo.