linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
2.01k stars 156 forks source link

start + end outside length of audio #6

Open doublex opened 1 year ago

doublex commented 1 year ago

This 15s audio: gaenswein15.zip

Command line:

python3 whisper_timestamped/transcribe.py ~/gaenswein15.wav --model large-v2 --language de

Timestamps:

    {
      "id": 2,
      "seek": 1300,
      "start": 27.16,
      "end": 27.86,
      "text": " Das hat er als emeritus Ritus gewünscht.",
Jeronymous commented 1 year ago

Thank you @doublex for reporting and giving a simple way to reproduce (I really appreciate that). I'll investigate this after having fixed issue #4

Jeronymous commented 1 year ago

This seems to be mainly due to the use of "large-v2", which predicts some nonsense timestamps.

$ whisper gaenswein15.mp3 --model large-v2 --language de

[00:00.000 --> 00:09.000]  Die Wiederzulassung des Messbuchs von 1962 als Missale für die außerordentliche Form des römischen Ritus
[00:09.000 --> 00:13.000]  ist dann nicht so weitergegangen, wie sich Papst Benedikt das gewünscht hatte.
[00:13.000 --> 00:31.000]  Das hatte er als emeritus Ritual.

(The last "00:31.000" is outside the boundary)

I had other issues with the upgrade from large-v1 to large-v2, and I heard other complaints about v2 giving worse results than v1. So first, I advise to use model large-v1 instead (if not medium).

Concerning the bug, I think that the main could/should be fixed in openai-whisper itself (unfortunately, there does not seem to be any "Issue" section on this repo...). The decoding process should prevent from predicting timestamps outside of signal boundaries (there is already a mechanism to exclude some predictions at each decoding step, some additional constraints should be added to take into account maximal duration on the padded input signal).

I'm watching what I can do in whisper-timestamped. I am not sure that the general problem can be fixed without touching whisper (maybe I'll end up forking whisper in the end, 'cause I am already hacking and retro-engineering it a lot...). But in the particular case of this issue, it's only the last end timestamp that is wrong, so maybe I can do something... (if the start timestamp of the last segment would be outside the boundary, it would be much worse...)

Jeronymous commented 1 year ago

@doublex Do you allow me to add an mp3 version of the audio you sent here in the repo (for testing purpose in the "test/data" folder)?

Jeronymous commented 1 year ago

OK this is much better now. I pushed an improvement that solves the issue on your audio. (I'm still interested to know if I include that audio in the testsuite)

doublex commented 1 year ago

Thanks a lot! Great code! The interview was public - but unfortunately the recording is not mine. I'm not a lawyer.

Tangyiming205069 commented 1 year ago

Facing the same issue, I cannot upload the zip audio since it is too big. Using large-v2, I have 30 audio files and face some out of bound issue on some file and some even (the start timestamp of the last segment would be outside the boundary), as you mentioned. Then, I used large-v1 to reprocess all files, but same thing happen on some other files (the file look good when using v2). I set vad = True and recompute_all_timestamps as default (False)

Screenshot 2023-05-16 at 15 39 02 Screenshot 2023-05-16 at 15 39 13

The pics show the result from whisper and the true lenght of the audio.

Jeronymous commented 1 year ago

Thank you for reporting @Tangyiming205069 It's a pity that this can occur.

Some questions that can help to investigate:

Tangyiming205069 commented 1 year ago
  1. No such warnings.

  2. I have 30 files. Without VAD, the 03 file does not have the issue any more but occur at other files. (with VAD or using large-v2, out of bound happens on different files) The first is with VAD, the second is without. simple means there exists a word ending out of bound, severe means the start of the word out of bound. The second one cannot be screenshot in one pics. Basically, there are 6 files out of bound and the others are good.

    Screenshot 2023-05-17 at 21 27 47 Screenshot 2023-05-17 at 21 29 55
  3. I do have a file recored by VAD. It records who is talking at each frame. When transfer seconds to frames, there exists a word at range of no one speaking. It happens more frequently, so I develop a way to assign word to speaker when no speaking.

ethanzrd commented 1 year ago

Hey!

Regarding this issue:

I don't think I can reproduce it, sometimes it happens, sometimes it doesn't. I have an application that constantly transcribes 2-second long batches and the start + end times really overshoot, using the small model. Here's an example:

{
    "text": " I'm going to see what's up. No it's also...",
    "segments": [
        {
            "id": 0,
            "seek": 0,
            "start": 19.7,
            "end": 29.14,
            "text": " I'm going to see what's up. No it's also...",
            "tokens": [
                50363,
                314,
                1101,
                1016,
                284,
                766,
                644,
                338,
                510,
                13,
                1400,
                340,
                338,
                635,
                986,
            ],
            "temperature": 0.0,
            "avg_logprob": -0.2860553562641144,
            "compression_ratio": 0.8958333333333334,
            "no_speech_prob": 0.07040286064147949,
            "confidence": 0.78,
            "words": [
                {"text": "I'm", "start": 19.7, "end": 24.82, "confidence": 0.588},
                {"text": "going", "start": 24.82, "end": 24.84, "confidence": 0.766},
                {"text": "to", "start": 24.84, "end": 24.86, "confidence": 0.994},
                {"text": "see", "start": 24.86, "end": 24.88, "confidence": 0.953},
                {"text": "what's", "start": 24.88, "end": 27.68, "confidence": 0.971},
                {"text": "up.", "start": 27.68, "end": 27.7, "confidence": 0.997},
                {"text": "No", "start": 27.7, "end": 27.72, "confidence": 0.624},
                {"text": "it's", "start": 27.72, "end": 28.94, "confidence": 0.922},
                {"text": "also...", "start": 28.94, "end": 29.14, "confidence": 0.405},
            ],
        }
    ],
    "language": "en",
}

However, when transcribed separately (not following previous batches), it still overshoots, but not as hard:

{
    "text": " to see what's up. No, it's also a",
    "segments": [
        {
            "id": 0,
            "seek": 0,
            "start": 0.0,
            "end": 0.74,
            "text": " to see what's up.",
            "tokens": [50363, 284, 766, 644, 338, 510, 13, 50413],
            "temperature": 0.0,
            "avg_logprob": -0.9967834299260919,
            "compression_ratio": 0.9523809523809523,
            "no_speech_prob": 0.39323943853378296,
            "confidence": 0.302,
            "words": [
                {"text": "to", "start": 0.0, "end": 0.22, "confidence": 0.028},
                {"text": "see", "start": 0.22, "end": 0.32, "confidence": 0.171},
                {"text": "what's", "start": 0.32, "end": 0.52, "confidence": 0.738},
                {"text": "up.", "start": 0.52, "end": 0.74, "confidence": 0.954},
            ],
        },
        {
            "id": 1,
            "seek": 0,
            "start": 1.38,
            "end": 2.24,
            "text": " No, it's also a",
            "tokens": [50413, 1400, 11, 340, 338, 635, 257, 50463],
            "temperature": 0.0,
            "avg_logprob": -0.9967834299260919,
            "compression_ratio": 0.9523809523809523,
            "no_speech_prob": 0.39323943853378296,
            "confidence": 0.409,
            "words": [
                {"text": "No,", "start": 1.38, "end": 1.58, "confidence": 0.253},
                {"text": "it's", "start": 1.66, "end": 1.82, "confidence": 0.892},
                {"text": "also", "start": 1.82, "end": 2.04, "confidence": 0.292},
                {"text": "a", "start": 2.04, "end": 2.24, "confidence": 0.194},
            ],
        },
        {
            "id": 2,
            "seek": 0,
            "start": 2.0,
            "end": 3.5,
            "text": " It's a",
            "tokens": [632, 338, 257],
            "temperature": 0.0,
            "avg_logprob": -0.9967834299260919,
            "compression_ratio": 0.9523809523809523,
            "no_speech_prob": 0.39323943853378296,
            "words": [
                {"text": "It's", "start": 2.84, "end": 3.48, "confidence": 0.215},
                {"text": "a", "start": 3.48, "end": 3.5, "confidence": 0.464},
            ],
            "confidence": 0.278,
        },
    ],
    "language": "en",
}

Audio file used (transcribed using its waveform)

The transcription call:

transcription = whisper.transcribe(
    "small.en",
    audio,
    initial_prompt=self._buffer,
    language="en",
    verbose=True  # Disable progress bar
)

The buffer is an empty string in the second example since there were no batches prior to it.

Happens pretty often with other audio files. Would love to get your opinion :)