linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
1.85k stars 149 forks source link

Inconsistent number of segments error #64

Closed olevanss closed 6 months ago

olevanss commented 1 year ago

Hi!

Recently I launched transcription and received such error:

File "/usr/local/lib/python3.9/site-packages/whisper_timestamped/transcribe.py", line 259, in transcribe_timestamped (transcription, words) = _transcribe_timestamped_efficient(model, audio, File "/usr/local/lib/python3.9/site-packages/whisper_timestamped/transcribe.py", line 851, in _transcribe_timestamped_efficient assert l1 == l2 or l1 == 0, f"Inconsistent number of segments: whisper_segments ({l1}) != timestamped_word_segments ({l2})" AssertionError: Inconsistent number of segments: whisper_segments (57) != timestamped_word_segments (56)

Could you know the reason behind it?

If you need some more details please let me know

Jeronymous commented 1 year ago

This is a duplicate of #59

I fixed this issue recently, and the fix landed in master a few minutes ago. Can you please update and retry? pip install --upgrade --no-deps --force-reinstall git+https://github.com/linto-ai/whisper-timestamped

Jeronymous commented 1 year ago

I'm closing assuming it is fixed. If it still fails for you, you can reopen and give the output of whisper_timestamped --versions (this gives whisper_timestamped.__version__ as well as whisper.__version__)

darnn commented 1 year ago

Still happening for me with both Whisper and Whisper Timestamped updated: 1.12.3 -- Whisper 20230314 in C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site- packages\whisper

Jeronymous commented 1 year ago

Thanks! Oh! Whisper released a new version yesterday... That may explain.

If it's a blocker for you, you can try pip install whisper==20230308 (or even version 20230124 which is not bad) and tell us if that resolves.

Jeronymous commented 1 year ago

However I'm not seeing anything particular in the last release that would explain the failure... Is there any chance that you can share the audio and the details of all the options you use, for us to reproduce? (at least all the options)

darnn commented 1 year ago

Sure. Audio: https://drive.google.com/file/d/1Gws313lBSie3HswzkhiOKMOf0HS6yMH8

Command and error: C:\downloaded>whisper_timestamped efrat.wav --model tiny --output_dir c:\victor Detected language: Hebrew 100%|██████████████████████████████████████████████| 109316/109316 [03:30<00:00, 519.28frames/s] WARNING:whisper_timestamped:Inconsistent number of segments: whisper_segments (621) != timestamped_word_segments (620) Traceback (most recent call last): File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\User\AppData\Local\Programs\Python\Python39\Scripts\whisper_timestamped.exe__main__.py", line 7, in File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\whisper_timestamped\transcribe.py", line 2127, in cli result = transcribe_timestamped( File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\whisper_timestamped\transcribe.py", line 259, in transcribe_timestamped (transcription, words) = _transcribe_timestamped_efficient(model, audio, File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\whisper_timestamped\transcribe.py", line 851, in _transcribe_timestamped_efficient assert l1 == l2 or l1 == 0, f"Inconsistent number of segments: whisper_segments ({l1}) != timestamped_word_segments ({l2})" AssertionError: Inconsistent number of segments: whisper_segments (621) != timestamped_word_segments (620)

Jeronymous commented 1 year ago

Thanks a lot @darnn Now I can reproduce :)

I will work on this soon

Jeronymous commented 1 year ago

This should be finally fixed in version 1.12.5

(Sorry about the inconvenience in previous versions, it took me some times before finding the good solution to some corner case, now I think I did it right)

Thanks again for well reporting this issue @darnn

Jeronymous commented 1 year ago

Still work in progress actually. I encountered another corner case that fails

darnn commented 1 year ago

ty!

jeremymatt commented 1 year ago

Still work in progress actually. I encountered another corner case that fails

Is this error still considered a work-in-progress? If it is, my thanks for your work and please disregard the info below (unless it's useful to you).

If not, I'm still encountering it using the medium model (I'm currently trying the other model sizes to see if they fail):

  File ~\Anaconda3\envs\stt\lib\site-packages\whisper_timestamped\transcribe.py:860 in _transcribe_timestamped_efficient
    assert l1 == l2 or l1 == 0, f"Inconsistent number of segments: whisper_segments ({l1}) != timestamped_word_segments ({l2})"

AssertionError: Inconsistent number of segments: whisper_segments (1118) != timestamped_word_segments (1117)

Another bit of information - the raw version of this audio stream does not crash the transcription script. However, the file is noisy and the transcription quality isn't great (lots of repeated text) so I ran the logmmse version of the Kalman filter on it. This substantially improved the audio quality, but transcribing now fails. The logmmse settings for this particular run are (I'm also trying a few different noise thresholds to see which works best for my dataset - I'm not sure if other noise thresholds cause failure or not):

sr = 16_000
raw_audio,sr = librosa.load(audio_path,sr=sr)
filtered_audio = logmmse.logmmse(raw_audio,sr,initial_noise=6,window_size=0,noise_threshold = 0.01)
#Save filtered file for subsequent use (e.g., loading into Whisper for transcription - I use librosa for that as well)
soundfile.write(filtered_audio_output_path,filtered_audio ,sr) 

Versions:

# Name                    Version                   Build  Channel
openai-whisper            20230314                 pypi_0    pypi
whisper                   1.1.10                   pypi_0    pypi
whisper-timestamped       1.12.8                   pypi_0    pypi
Jeronymous commented 1 year ago

Oh dear, I was not aware this could fail again.

This kind of error really depends on what is transcribed by the inner Whisper model. With a "butterfly effect" that makes the issue hardly reproducible. Is there any chance you can share the "filtered file" along with all the options you give to whisper-timestamps.transcribe?

stungkuling commented 1 year ago

Hello, this did the trick for me.

Just adding the options "beam_size=5, best_of=5" in the transcribe method of the module.

results = whisper_timestamped.transcribe(model, audio, verbose=True, beam_size=5, best_of=5)

I hope this helps.

jeremymatt commented 1 year ago

Oh dear, I was not aware this could fail again.

This kind of error really depends on what is transcribed by the inner Whisper model. With a "butterfly effect" that makes the issue hardly reproducible. Is there any chance you can share the "filtered file" along with all the options you give to whisper-timestamps.transcribe?

Sorry for the delay, I've been busy with other stuff. Unfortunately I can't share the file (it's a HIPA-protected recording of a healthcare conversation).

I've updated to version 1.12.8 and am still encountering this error (although with a different file now - the other one started working when I switched condition_on_previous_text from True to False (this helped with hallucination problems).

The call to whisper is as follows:

import whisper_timestamped as whisper
options = {"task":"transcribe",
                               "language":"English",
                               "fp16":fp16,
                               'no_speech_threshold':0.1,
                               "condition_on_previous_text": False,
                               "logprob_threshold": -1.00}
result = whisper.transcribe(model, audio=audio, verbose=False, **options)
Jeronymous commented 1 year ago

Thank you @jeremymatt for your feeback. Unfortunately, I don't have enough element to reproduce. But I modified something in the latest version (1.12.10) that might resolve this bug. Can you please retry where it was failing?

If it still fails, could you please use the --debug option and send me the stderr (it can be by email: there is my email in the commit logs of that repo). The --debug option is with the CLI, but if you are in python you can activate the debug logs using:

import logging
logging.basicConfig()
logger = logging.getLogger("whisper_timestamped")
logger.setLevel(logging.DEBUG)

Finally, if it's really a blocker for you, a workaround is to disable efficient decoding, as spotted by @stungkuling . This can be done in python by using one of these options with whisper-timestamped's transcribe() function:

Just the decoding time will be higher. But transcription results can also be better (especially with beam_size = 5, temperature = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), best_of = 5 which is the default in OpenAI's whisper lib). Independently of that workaround, I'm interested in solving this bug :) meaning interested in reproducing it (so any help to do so is welcome).

Jeronymous commented 1 year ago

I finally identified something that could cause this error.

I cross fingers very hard so that this bug is finally solved in new version 1.12.11

jeremymatt commented 1 year ago

Thanks for your hard work on this! It's a super useful tool. Helping me out an ton, and I'll be using it for at least one paper.

I'll re-try the problematic file in a bit and will let you know how it goes.

Another solution (sort of) is to transcribe in parts and then just join the transcripts. This is similar to how I'm dealing with the hallucinations. Hallucinations are easy to detect as they consist of repeated phrases - at least for my transcripts, if there wasn't phrase repetition, the quality of transcription is acceptable. There's some funkiness such as if a word shows up twice in a phrase. For example, "I think that I should I think that I should I think that I should" is a period 5 repetition, but "I" has a 3/2/3 pattern. Anyway, I just find repetition, clip that out of the transcript, and then re-transcribe only that section of audio.

eloukas commented 11 months ago

Any updates on this?

Jeronymous commented 11 months ago

We had no feedback whether it was fixed for @jeremymatt And as nobody reported this error anymore we assumed it was fixed after April 3rd (version 1.12.11 and higher).

Do you have such an exception @eloukas ? If yes, can you give more details and maybe way to reproduce? If there is something we can re-open this issue or open another

iampickle commented 7 months ago

versions: -python

whisper-timestamped==1.14.4 torch==1.13.0

-system

nvidia-cuda-toolkit==11.8

got the same error: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3901509/3901509 [01:46<00:00, 36483.15frames/s] Inconsistent number of segments: whisper_segments (1388) != timestamped_word_segments (1109) Traceback (most recent call last): File "/home/tbot/twitchbot/test.py", line 7, in

File "/home/tbot/miniconda3/envs/tbot/lib/python3.11/site-packages/whisper_timestamped/transcribe.py", line 285, in transcribe_timestamped (transcription, words) = _transcribe_timestamped_efficient(model, audio, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tbot/miniconda3/envs/tbot/lib/python3.11/site-packages/whisper_timestamped/transcribe.py", line 903, in _transcribe_timestamped_efficient assert l1 == l2 or l1 == 0, f"Inconsistent number of segments: whisper_segments ({l1}) != timestamped_word_segments ({l2})"

this is my code:

import whisper_timestamped as whisper

audio = whisper.load_audio("/media/raid/twitch/papaplatte/papaplatte-stream-2024-01-30/temp_1.5_15.22.mp4")

model = whisper.load_model("tiny", device="cuda")

result = whisper.transcribe(model, audio, language='de')

import json
print(json.dumps(result, indent = 2, ensure_ascii = False))

.. When assertionerror was commented out code was able to print results in json. But im not shure if they´re somewhat reliable

blob of the data: https://pastes.io/embed/bsmewxtuyd

Jeronymous commented 7 months ago

Thanks @iampickle

I reopen this issue, that is also being discussed here: https://github.com/linto-ai/whisper-timestamped/discussions/79

Having your openai-whisper version would also help to understand. And is it possible to have the audio, to be able to reproduce? (otherwise, see last comment in the discussion linked above: there are ways to have more debug outputs)

And I think this bug is problematic for the result (that is probably wrong). A possible workaround is to use naive_approach=True as commented here: https://github.com/linto-ai/whisper-timestamped/issues/64#issuecomment-1494398225 (just things should be slower with this workaround)

iampickle commented 7 months ago

Shure, Download for mp4 22GB !? At fist I used the newest version and then tried the version(whisper==20230308) mentioned above. Both gave the same result. Debug out: clio.txt

lumpidu commented 6 months ago

So I tested this module to see if I get anywhere with my finetuned whisper-v2 model. Unfortunately, the timestamps are often bad, especially if I am using beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)with or without VAD.

  1. The segments don't seem to overlap enough, i.e. the words at those boundaries are not correct. I don't know if you can work around this with VAD or not, but IMHO one should overlap the segments by at least 5 seconds and synchronize the generated text OR one needs to use VAD to find a speech pause in time before the 30 seconds boundary, so that whisper doesn't need to transcribe an utterance that has been cut off in the middle.
  2. One should add a consistency check for the generated segments. Often problematic segments don't add up with the previously generated segments, i.e. end time of the previous segment and start time of the next segment
  3. Often I can see, that segments with problems have very low confidence level for most words as well, although the words themselves are transcribed correctly

As you often ask about concrete audio files: these are audios generated via Microsoft TTS from an Icelandic voice. The text itself is not specific. My guess is that you can use the same approach (use a TTS system) to generate enough of test data yourself.

The problem lies not in the TTS audio files. These are very clear, have consistent timing and pauses. No background noise at all, etc.

Jeronymous commented 6 months ago

@lumpidu If I understand correctly, your problem is not related to the current issue (which is a failure that can happen in some corner case, that I could not reproduce yet), but the quality of the timestamps with a finetuned model? (maybe due to alignment heads that have to be re-estimated for this model).

concerning 1 : Do you mean you need overlapping segments/words? concerning 2 : There are already many consistency checks in the code. Are you suggesting here that segments should be contiguous (start where the previous ended) when VAD do not detect silence?

Anyway, this description is not clear enough to me to understand the suggestion. 1) If you don't see an assertion failure ("Inconsistent number of segments") please open a separate issue. 2) A concrete example would be good to clarify (ex: here is the audio, here is the output of whisper-timestamped, I'm not satisfied with happens with that segment... and this and this...). If you're using examples from a TTS system, I guess there is no problem to share. We have a lot of test data already (coming from real use cases), we are not going to run Microsoft TTS (particularly because it's not free, not open-source)

lumpidu commented 6 months ago

Yes maybe it's a different bug, but maybe it's also related. You need to decide. I see e.g. the following problems when looking at the segments:

"segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 30.0,
      "text": "afbrot og refsjábyrgð eitt efnisyfirlit ...",
      "tokens": [...],
      "temperature": 0.0,
      "avg_logprob": -0.024239512714179786,
      "compression_ratio": 1.8644067796610169,
      "no_speech_prob": 8.609986252849922e-05,
      "confidence": 0.988,
      "words": [ ... ]
     ....
    },
    {
      "id": 1,
      "seek": 3000,
      "start": 30.0,
      "end": 31.58,
      "text": "ilög á grundvelli þjóðréttarsamninga tuttugu og tvö þrjú íslensk refsilög og áhrif mannréttindareglna...",
      "tokens": [ ... ],
      "confidence": 0.031,
     ...
    },
    {
      "id": 2,
      "seek": 6000,
      "start": 59.74,
      "end": 60.8,
      "text": "fsiréttar í fræðikerfi lögfræðinnar tuttugu og sjö fjögur grundvallarhugtökin afbrot og refsing tuttugu og sjö...",
      "tokens": [ ... ],
      "confidence": 0.011,
     ...
    },
...
]

Take a look at the start, end segment data:

There is no warning on stderr/stdout about non-aligning segments or low confidence values of the transcripts. There is also no way any ASR system can generate correct first or last words, if segments start or end in the middle of a spoken word. Therefore my suggestion proposes to use a less naive approach either via VAD or overlapping segments. It's not clear for me, which of these approaches already has been implemented by whisper_timestamped.

Jeronymous commented 6 months ago

OK @lumpidu so it's another issue. About the quality of the timestamps. It is normal that consecutive segments can be not contiguous : there can be silence in between. And the low quality of the alignement maybe due to the fact that you are using a finetuned model, without having adapted alignment heads. If you want this to be investigated, please open a new issue, providing the audio and the exact thing that you run for reproduction.

Jeronymous commented 6 months ago

@iampickle The failure should not happen anymore (in new version 1.15.0 of whisper-timestamped).

Thank you for having given everything to reproduce and investigate that properly. And sorry it took me sometimes to investigate, handling the 10H audio was tricky.

Note that the transcription results are rather poor on your audio with music (it transcripts only "Musik"). This is partly due to the fact that you are using a "tiny" model (moreover you are transcribing with default greedy decoding). And of course, transcribing music is challenging for the model. But at least it permitted to spot a possible corner case of failure.