Whisper stutters - Githubissues

jrp2014 commented 6 months ago

Using

import mlx_whisper

speech_file = "stereo.mp3"

text = mlx_whisper.transcribe(speech_file,path_or_hf_repo=f"mlx-community/whisper-large-v3-mlx",verbose=False)["text"]

f=open("result.txt","w+")

f.write(text)

f.close()

I find that the output contains repeated phrases from time to time, enough to ruin the transcription. Eg:

... you used the address you put it in. Yes.        Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes.        Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. One day ... he arrived at the station and      he was in the village. And he was in the village. And he was in the village. And he was in the             village. And he was in the village. And he was in the village. And he was in the village. And he was       in the village. And he was in the village. And he was in the village. And he was in the village. And       he was in the village. And he was in the village. And he was in the village. And he was in the             village. And he was in the village. And he was in the village. And he was in the village. And he was       in the village. And he was in the village. And he was in the village. And he was in the village. And       he was in the village. And he was in the village. And he was in the village. And he was in the             village. And he was in the village. And he was in the village. And he was in the village. or horses        or whatever.

Maybe this is a feature of the underlying model?

awni commented 6 months ago

Interesting.. I've seen that behavior before in lower quality models. Two questions:

Are you using 16-bit or 32-bit precision?
Did you try the PyTorch implementation on the same audio file? https://github.com/openai/whisper

jrp2014 commented 6 months ago

I expect that the mp3 will be 16-bit.

The problem seems to be a feature of the underlying architecture: padding to 30s chunks, eg. The original paper offered some mitigations, but they were far from completely effective. (Eg, using the results from the previous chunk as a prompt, fiddling with temperatures and beaming, etc). WhisperX seems to do a better job, but needs components that are only x86/cuda based.

It seems ironic that a i effectiveness should rely on hand tuning that is input-specific. 😇

jrp2014 commented 5 months ago

As most of the output seems v accurate, I can only suppose that the repetition is caused by some heuristic that says "if you cannot generate output, just repeat what you just produced". Reasons for not generating output could include either silence, or padding from the 30s chunking, or background noise, or ...

awni commented 5 months ago

The repetition problem is a common problem with encoder-decoder style models. Though it usually becomes vanishingly rare for high quality models. Indeed it could be that edge case inputs are more likely to trigger it.

I expect that the mp3 will be 16-bit.

I meant the model parameters. The default model is fp16 which may be slightly worse. You could try using an fp32 (pass fp16=False to the transcribe function). Also you could try a larger model (like Whisper large)

jrp2014 commented 5 months ago

Thanks. I'm just trying a recording of a back and forth chat. Most of the transcription looks great, it's just these repetitions that are anomalous.

I've tried using this:

import mlx_whisper

speech_file = "/Users/jrp/.cache/whisper/alice.mp3"

result = mlx_whisper.transcribe(speech_file,path_or_hf_repo=f"mlx-community/whisper-large-v3-mlx",verbose=False, fp16=False)

f=open("result.txt","w+")

for segment in result["segments"]:
     print(segment["text"], file=f)

f.close()

with fp16 and fp32 versions on the Alice chapter, to get started. Most of the differences (see attachment) seem to be just how the output is segmented (with the fp16 version (<) being preferable in most cases, but there are a couple of oddities. Eg: diff.txt

65,71c65
<  what is the reason for my being so different?
<  I wonder if I've changed in the night.
<  Let me think, was I the same when I got up this morning?
<  I almost think I can remember feeling a little different. But if I'm not the same, the next question is,
< 
< 
<  The next question is, who in the world am I?
---
>  who in the world am I?

:

>  and then I'll tell you my history,
>  and you'll understand why it is that I hate cats and dogs.
280,281c296
<  for the pool was getting quite crowded
<  with birds and animals that had fallen into it.
---
>  for the pool was getting quite crowded with birds and animals that had fallen into it.
289c304
<  End of chapter two.
---
>  Chapter 2

jrp2014 commented 5 months ago

Blimey, the f32 version is about half the speed of the fp32 one. Doesn't half exercise the fans on this 48Gb machine. GPUs are at 100%...

Does transcription stream, or is it going to just increase memory demand?

Looking at the various whisper offshoots (the original, lightening, kit., etc). They all seem to suffer from the same problem, with various heuristics being added and subtracted.

jrp2014 commented 5 months ago

... and the f32 version also stutters / hallucinates for me.

This is a pity. Most of the output is remarkably good, it just seems to be the chopping up the input, padding it, and stitching it back together seems to introduce errors.

awni commented 5 months ago

~There is an option that I don't think is currently implemented in the MLX example:~

Edit: it is actually implemented and should be enabled by default.

    condition_on_previous_text: bool
        if True, the previous output of the model is provided as a prompt for the next window;
        disabling may make the text inconsistent across windows, but the model becomes less prone to
        getting stuck in a failure loop, such as repetition looping or timestamps going out of sync.

x4080 commented 4 months ago

Is there solution for the repeated text yet ?

jrp2014 commented 2 months ago

I ran the troublesome file at assemblyai and it produced a great result. Whisper is still a bit better at recognising foreign names and capitalising the names of countries, etc, but these are fairly easy to change manually.

Don't know what the underlying model that they use is, or how they feed it, but it managed to chomp its way through 9 hours of MP3 / wav without much problem.

It's not completely deterministic: sometimes it determines that there is a sentence break, sometimes just a comma. It usually doesn't really matter.

ml-explore / mlx-examples

Whisper stutters #774