linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
2.01k stars 156 forks source link

Whisper_timestamped does not transcript all the video? #33

Closed aliscie closed 1 year ago

aliscie commented 1 year ago

the Whisper_timestamped transcript only the first 10 words and ignore the rest?

Jeronymous commented 1 year ago

What do you mean by "sample_video"?

Do you have a command line or python code to show the problem you have?

Jeronymous commented 1 year ago

The description in the issue #32 you opened is a bit more clear.

Maybe you have to lower the --no_speech_threshold (try 0.0 instead of the default 0.6...)?

Can you try whisper alone to see if you have the same problem? (launch whisper instead of whisper_timestamped)

If whisper is fine (if it solves your problem), then you can try whisper_timestamped with option --accurate.

aliscie commented 1 year ago

What do you mean by "sample_video"?

Do you have a command line or python code to show the problem you have?

Oh man sirry I ment whisper_timestamped package . That is my project nae 😂😂

my code is

import whisper_timestamped as whisper

audio = whisper.load_audio("my_audio_file.wav")

model = whisper.load_model("base", device="cpu")

result = whisper.transcribe(model, audio, language="en")

import json
print(json.dumps(result, indent = 2, ensure_ascii = False))
Jeronymous commented 1 year ago

OK, your code is fine, and I cannot investigate much without having the audio to reproduce.

Is there something particular about the portion of speech that is not transcribed? (low volume...)

Have you tied with import whisper instead of import whisper_timestamped as whisper?

Have you tried whisper_timestamped with option no_speech_threshold = 0?

You can also play with options compression_ratio_threshold and logprob_threshold (lowering those thresholds).

And if the previous did not work, also try temperature = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), best_of = 5.

aliscie commented 1 year ago

OK, your code is fine, and I cannot investigate much without having the audio to reproduce.

Is there something particular about the portion of speech that is not transcribed? (low volume...)

Have you tied with import whisper instead of import whisper_timestamped as whisper?

Have you tried whisper_timestamped with option no_speech_threshold = 0?

You can also play with options compression_ratio_threshold and logprob_threshold (lowering those thresholds).

And if the previous did not work, also try temperature = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), best_of = 5.

@Jeronymous Here should I put these options whisper_timestamped? is it like

options = {
"no_speech_threshold":0,
}
model = whisper.load_model("base", device="cpu", no_speech_threshold=0,options)

I tried

  1. import whisper instead of import whisper_timestamped as whisper 1.1 it return all the text and it works fine but i want time stamps for each word
  2. i have no idea what is
    You can also play with options compression_ratio_threshold and logprob_threshold (lowering those thresholds).
Jeronymous commented 1 year ago

OK let me clarify. You can add options here:

result = whisper.transcribe(model, audio, language="en")

If whisper works fine, then this should work fine:

result = whisper.transcribe(model, audio, language="en",
   beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
)

And if you are concerned about processing time (wants things to run fast), you can give a try to:

result = whisper.transcribe(model, audio, language="en",
   no_speech_threshold = 0
)

Note: I see you are using the "base" model, but the performance of this model are not good (about twice more transcription errors than the "small" model). So if you can afford more computation time / memory usage, I recommend that you use "small" if not "medium" model. It of course depends on the accuracy you are expecting.

aliscie commented 1 year ago

@Jeronymous What about using "large" in model = whisper.load_model("large", device="cpu") instead of using "base"? Would that help?

Jeronymous commented 1 year ago

The transcription will be certainly better. And the computation time higher also... You're the best judge depending on your application. Just give it a try and see.