Closed YeDaxia closed 1 year ago
Default options are not the same in whisper-timestamped and in whisper. That could be it.
Can you try using the following last 3 options:
result = whisper.transcribe(model, audio, language=lang,
beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
)
Default options are not the same in whisper-timestamped and in whisper. That could be it.
Can you try using the following last 3 options:
result = whisper.transcribe(model, audio, language=lang, beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0) )
I try to add the options above, it seems no change.
I get the same type of problem.
The error:
Too much text (138 tokens) for the given number of frames (106) in: <|0.00|>Framdrift i forhold til denne dataen er at vi har gjennomgang i juni. Vi har avtalt med ungdomsrådet at vi skal ha en workshop hvor ungdommen inviterer ulike bidragsytere til å ha en prosess knyttet til bildet vi ser foran oss. Hva ser vi for oss som viktige tiltak? Vi gir en tilbakemelding i det gjennom ungdomsrådet. Så må denne ungdata ses i sammenheng med andre tilbakemeldinger vi får.<|1.62|>
The output:
Framdrift i forhold til denne
@mmichelli I don't know if it is the same problem as @YeDaxia but your description is clear. (@YeDaxia, have you seen any similar warnings when running your script?)
What you observe happens when there is not enough time to assign a timestamp of 20ms for all the recognized words (so something is probably wrong in Whisper's outputs). However your output is surprisingly short! I don't see how that's possible, and suspect a bug... Your output should be about 50 words at least (106 sub-word tokens, including punctuations and start/end timestamp tokens).
Also, I think I can improve things if what is wrong in whisper prediction is the end of time. Let me work on that.
I pushed a new version that should fix the problem observed by @mmichelli
The issue of @YeDaxia is not clear to me. I would need more information to reproduce and understand the issue (an audio along with all the options used).
enviroment:
part code:
Here I use result["segments"] to generate subtitle:
lost result:
without whisper-timestamped:
Am I use the wrong version ?