Open jrp2014 opened 6 months ago
Interesting.. I've seen that behavior before in lower quality models. Two questions:
I expect that the mp3 will be 16-bit.
The problem seems to be a feature of the underlying architecture: padding to 30s chunks, eg. The original paper offered some mitigations, but they were far from completely effective. (Eg, using the results from the previous chunk as a prompt, fiddling with temperatures and beaming, etc). WhisperX seems to do a better job, but needs components that are only x86/cuda based.
It seems ironic that a i effectiveness should rely on hand tuning that is input-specific. 😇
As most of the output seems v accurate, I can only suppose that the repetition is caused by some heuristic that says "if you cannot generate output, just repeat what you just produced". Reasons for not generating output could include either silence, or padding from the 30s chunking, or background noise, or ...
The repetition problem is a common problem with encoder-decoder style models. Though it usually becomes vanishingly rare for high quality models. Indeed it could be that edge case inputs are more likely to trigger it.
I expect that the mp3 will be 16-bit.
I meant the model parameters. The default model is fp16 which may be slightly worse. You could try using an fp32 (pass fp16=False
to the transcribe
function). Also you could try a larger model (like Whisper large
)
Thanks. I'm just trying a recording of a back and forth chat. Most of the transcription looks great, it's just these repetitions that are anomalous.
I've tried using this:
import mlx_whisper
speech_file = "/Users/jrp/.cache/whisper/alice.mp3"
result = mlx_whisper.transcribe(speech_file,path_or_hf_repo=f"mlx-community/whisper-large-v3-mlx",verbose=False, fp16=False)
f=open("result.txt","w+")
for segment in result["segments"]:
print(segment["text"], file=f)
f.close()
with fp16 and fp32 versions on the Alice chapter, to get started. Most of the differences (see attachment) seem to be just how the output is segmented (with the fp16 version (<) being preferable in most cases, but there are a couple of oddities. Eg: diff.txt
65,71c65
< what is the reason for my being so different?
< I wonder if I've changed in the night.
< Let me think, was I the same when I got up this morning?
< I almost think I can remember feeling a little different. But if I'm not the same, the next question is,
<
<
< The next question is, who in the world am I?
---
> who in the world am I?
:
> and then I'll tell you my history,
> and you'll understand why it is that I hate cats and dogs.
280,281c296
< for the pool was getting quite crowded
< with birds and animals that had fallen into it.
---
> for the pool was getting quite crowded with birds and animals that had fallen into it.
289c304
< End of chapter two.
---
> Chapter 2
Blimey, the f32 version is about half the speed of the fp32 one. Doesn't half exercise the fans on this 48Gb machine. GPUs are at 100%...
Does transcription stream, or is it going to just increase memory demand?
Looking at the various whisper offshoots (the original, lightening, kit., etc). They all seem to suffer from the same problem, with various heuristics being added and subtracted.
... and the f32 version also stutters / hallucinates for me.
This is a pity. Most of the output is remarkably good, it just seems to be the chopping up the input, padding it, and stitching it back together seems to introduce errors.
~There is an option that I don't think is currently implemented in the MLX example:~
Edit: it is actually implemented and should be enabled by default.
condition_on_previous_text: bool
if True, the previous output of the model is provided as a prompt for the next window;
disabling may make the text inconsistent across windows, but the model becomes less prone to
getting stuck in a failure loop, such as repetition looping or timestamps going out of sync.
Is there solution for the repeated text yet ?
I ran the troublesome file at assemblyai and it produced a great result. Whisper is still a bit better at recognising foreign names and capitalising the names of countries, etc, but these are fairly easy to change manually.
Don't know what the underlying model that they use is, or how they feed it, but it managed to chomp its way through 9 hours of MP3 / wav without much problem.
It's not completely deterministic: sometimes it determines that there is a sentence break, sometimes just a comma. It usually doesn't really matter.
Using
I find that the output contains repeated phrases from time to time, enough to ruin the transcription. Eg:
Maybe this is a feature of the underlying model?