linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
2.01k stars 156 forks source link

The result is almost lost when I use whisper timestamped. #67

Closed YeDaxia closed 1 year ago

YeDaxia commented 1 year ago

enviroment:

openai-whisper           20230314
whisper-timestamped      1.12.7

part code:

import whisper_timestamped as whisper

audio = whisper.load_audio(audioFile)
model = whisper.load_model('large-v2')
result = whisper.transcribe(model, audio, language=lang)

Here I use result["segments"] to generate subtitle:

lost result:

img_v2_0488112c-653b-4ca4-a915-9e6af2f8548g

img_v2_473eabda-cffc-4c82-8ffe-d4be02fa967g

without whisper-timestamped:

img_v2_a1f618fc-10d0-4849-9a97-1c7a85042c0g

Am I use the wrong version ?

Jeronymous commented 1 year ago

Default options are not the same in whisper-timestamped and in whisper. That could be it.

Can you try using the following last 3 options:

result = whisper.transcribe(model, audio, language=lang,
   beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
)
YeDaxia commented 1 year ago

Default options are not the same in whisper-timestamped and in whisper. That could be it.

Can you try using the following last 3 options:

result = whisper.transcribe(model, audio, language=lang,
   beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
)

I try to add the options above, it seems no change.

mmichelli commented 1 year ago

I get the same type of problem.

The error: Too much text (138 tokens) for the given number of frames (106) in: <|0.00|>Framdrift i forhold til denne dataen er at vi har gjennomgang i juni. Vi har avtalt med ungdomsrådet at vi skal ha en workshop hvor ungdommen inviterer ulike bidragsytere til å ha en prosess knyttet til bildet vi ser foran oss. Hva ser vi for oss som viktige tiltak? Vi gir en tilbakemelding i det gjennom ungdomsrådet. Så må denne ungdata ses i sammenheng med andre tilbakemeldinger vi får.<|1.62|>

The output: Framdrift i forhold til denne

Jeronymous commented 1 year ago

@mmichelli I don't know if it is the same problem as @YeDaxia but your description is clear. (@YeDaxia, have you seen any similar warnings when running your script?)

What you observe happens when there is not enough time to assign a timestamp of 20ms for all the recognized words (so something is probably wrong in Whisper's outputs). However your output is surprisingly short! I don't see how that's possible, and suspect a bug... Your output should be about 50 words at least (106 sub-word tokens, including punctuations and start/end timestamp tokens).

Also, I think I can improve things if what is wrong in whisper prediction is the end of time. Let me work on that.

Jeronymous commented 1 year ago

I pushed a new version that should fix the problem observed by @mmichelli

The issue of @YeDaxia is not clear to me. I would need more information to reproduce and understand the issue (an audio along with all the options used).