Closed mirix closed 11 months ago
I have seen something similar, in my case it was caused by the --initial_prompt switch. For some reason, there's a difference between small and medium models with no_speech_probability (for this sample, haven't tested it more extensively than that).
EDIT: Found another one with the same symptoms (missing sentence after misc sounds). For this one, disabling VAD helped.
Have you tried with --accurate? EDIT: The help text on --accurate says the following:
Shortcut to use the same default option as in Whisper (best_of=5, beam_search=5, temperature_increment_on_fallback=0.2)
Vanilla Whisper doesn't have the --vad option so that's gotta be disabled if you want to do any comparison.
I don't use the command line but the Python API, but yes I use large-v2.
On Sun, 10 Sept 2023, 14:55 misutoneko, @.***> wrote:
I have seen something similar, in my case it was caused by the --initial_prompt switch. For some reason, the small model seems to have a better confidence on speech_probability than medium (for this sample, haven't tested it more extensively than that).
Have you tried with --accurate?
— Reply to this email directly, view it on GitHub https://github.com/linto-ai/whisper-timestamped/issues/115#issuecomment-1712807169, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH74KS72MGAJUUWSZKC54DXZW2E5ANCNFSM6AAAAAA4HF4Q2Y . You are receiving this because you authored the thread.Message ID: @.***>
Hello,
I am working on a diarisation project that is producing satisfactory results.
Right now it relies on stable_ts because, even if not perfect, it is the approach that has produced the most complete and accurate results so far.
WhisperX is also excellent but its inability to timestamp non-dictionary tokens is a no go limitation.
I have been playing a bit with whisper-timestamped as well and, indeed, the word-level timestamps seem to be more accurate than those produced with other methods. I also find the detect_disfluencies function very useful.
However, entire sentences are missing from the transcription in certain audio files. These are sentences that are immediately preceded by incomprehensible utterances.
I have tried all sort of options such as beam_size, temperature, best_of, switching VAD on and off, etc. But those sentences are always missing,
The interesting thing is that the same sentences are transcribed with vanilla Whisper, WhisperX, Faster Whisper, stable_ts and any other approach I have tested.
So my guess is that this has something to do with the Dynamic Time Warping.
Unfortunately, I cannot share any sample files as they are confidential.
Best,
Ed