linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
1.87k stars 150 forks source link

Key words in segment is missing #126

Closed mperetto closed 11 months ago

mperetto commented 11 months ago

Hello and thanks for this beautiful tool. Recently i have noticed that when i execute the transcribe of the audio i'm unable to obtain the timestamp of the words, the keyword words is missing from the segment.

I have installed the tool with the following

!pip3 install git+https://github.com/linto-ai/whisper-timestamped
!apt update && apt install ffmpeg

and this is the code

import whisper_timestamped as whisper
audio = whisper.load_audio('audio.wav')
model = whisper.load_model('medium')
result = model.transcribe(audio)

and this is the result i get

{
"text": "....."
"segments": [
{
   "id": 0,
   "seek": 0,
   "start": 2.0,
   "end": 4.0,
   "text": "....",
   "tokens": [0],
   "temperature": 0.0,
   "avg_logprob": -0.4302112872783954,
   "compression_ratio": 2.26,
   "no_speech_prob": 0.4641488790512085
}
]
}

All the code was executed on Google Colab, i have tried to restart the runtime but it did not work.

Am I doing something wrong?

Thank you in advance for your support.

Jeronymous commented 11 months ago

The thing is that there is no word in your transcriptions. There could/should be a "words" field, but it would be empty. You should wonder why is Whisper recognizing no text on your audio. Is it only musical background?

mperetto commented 11 months ago

Sorry i have removed the text. from the result. this is piece of the segment with the text recognized

{'id': 0,
   'seek': 0,
   'start': 0.0,
   'end': 2.0,
   'text': ' Un applauso per il professore!',
   'tokens': [50364, 1156, 724, 22590, 78, 680, 1930, 2668, 418, 0, 50464],
   'temperature': 0.0,
   'avg_logprob': -0.4302112872783954,
   'compression_ratio': 2.26,
   'no_speech_prob': 0.4641488790512085},
  {'id': 1,
   'seek': 0,
   'start': 2.0,
   'end': 4.0,
   'text': ' Dove va, professore?',
   'tokens': [50464, 1144, 303, 2773, 11, 2668, 418, 30, 50564],
   'temperature': 0.0,
   'avg_logprob': -0.4302112872783954,
   'compression_ratio': 2.26,
   'no_speech_prob': 0.4641488790512085
}

there are the text but no words

I had tried with the same audio some time ago and worked, perhaps some changes have been made to the model

i'll try with the params recommended. Thank you for the advise.

Jeronymous commented 11 months ago

Ah ok, sorry for the misunderstanding.

Then it seems to be a problem in your code. Instead of

result = model.transcribe(audio, ...)

It's meant to be

result = whisper.transcribe(model, audio, ...)
mperetto commented 11 months ago

Oh yes you are completely right, this fixed the problem.

Sorry for wasting your time with this stupid error.

Thanks again for your help.