Transcript generated on AMD CPU messed up

The3IC commented 1 month ago

Tried to transcribe a 14 minute audio (English, non-native speaker) and I get totally different results on AMD CPU and Intel GPU.

On the AMD, everything is fine for the first few minutes but then the process get's "stuck" and the last 10+ minutes are just the same label content being duplicated into every label. The labels are "connected" to what I assume to be the right place on the audio, but the content (textual timestamp and text) just repeat.

CPU load at about 60-65% and memory 20/64 Gbyte, so resource starvation would not seem to be the cause. Processor is a Ryzen 5 5600.

On Intel iron, it works as expected.

If somebody wants it, I can send you the audio and srt's but would prefer to do it somewhere else than in a post like this?

RyanMetcalfeInt8 commented 1 month ago

Hi @The3IC,

Which whisper model were you using for this issue, and is the behavior consistent across different whisper models (base, small, medium, etc.)?

Thanks, Ryan

The3IC commented 1 month ago

I used only V3 in both cases, do you want me to try some smaller model on AMD?

RyanMetcalfeInt8 commented 1 month ago

It might be worth a quick test to see if the issue is somewhat consistent using a smaller model -- but I'd have to think about what to do with that info 😆

Probably the more important question is this -- Do you find this behavior is tied to the audio sample that you are using, or if you were to substitute it for another sample of similar duration, does the issue go away?

The3IC commented 1 month ago

OK, so ran a few tests with the ailing audio:

base: about 1+ minute, generated transcript quality low, does not seem to obey "Max segment length" parameter, the "repeat" error did not occur

large-v2: about 10 minutes, transcript quality good but not as good as with large-V3 (done on intel), the "repeat" error did not occur

large-v3: about 24 minutes, best transcript quality on Intel iron but the "messed up" error is there so it is at least repeatable on my system.

So to answer your question, the problem only occurs for large-v3 (for this video).

ps: did a small video on my workflow together with Resolve V19: https://youtu.be/d3hLO_TmzIA

The3IC commented 1 month ago

Btw, my setting on the AMD machine are here:

Setting for a

The3IC commented 1 month ago

OK, one more status update before I hit the sack. Had to go back to the Intel iron to regenerate the transcript (I first saved it as txt and then deleted the output :-) ) and now the same thing happened also on the Intel iron!

It starts at a different point (the repeated text is differerent) but effect the same. Thoughts:

Last time I did it on Intel I forgot to add the Initial prompt" content so that would be the only diff on Intel between the good and the messed. See post above.
From the screen grab you can see that the repeat seems to start by a rather longish pause? Does this somehow trigger the error?

Intel iron

So running on large-V3 also here.

The3IC commented 1 month ago

One more data point, this time Intel hw, large-V3, CPU, no initial prompt, segment length 60. End of transcript again messed up, this time starting at a different point.

CPU no initial

RyanMetcalfeInt8 commented 1 month ago

Hmm -- segment length of 60 -- this is minutes or seconds? Edit: Oh sorry, segment length not duration -- disregard this question. What was the duration of this one?

Also, is this intermittent or repeatable? If you were to run the same exact case multiple times, does it produce consistent results?

The3IC commented 1 month ago

Documentation says segment length is characters and that seems to be the effect I get? It's still the same audio sample, just shy of 16 minutes to be exact. And seems it is repeatable as in different results based on parameter combo (CPU/GPU, Advanced settings).

RyanMetcalfeInt8 commented 1 month ago

And seems it is repeatable as in different results based on parameter combo (CPU/GPU, Advanced settings).

So just to be 100% sure I understand -- Given a single 'failing' configuration, it gives the same exact results if you were to try it 2 or 3 times?

The3IC commented 1 month ago

Yes, within the testing that I have done (it takes half an hour to run one test so sample size is not huge).

RyanMetcalfeInt8 commented 1 month ago

Thanks, I've downloaded and deleted your comment.

RyanMetcalfeInt8 commented 1 month ago

Yes, within the testing that I have done (it takes half an hour to run one test so sample size is not huge).

Got it, thanks -- yeah no problem, figured 16 mins takes quite a while with large-v3.

RyanMetcalfeInt8 commented 1 month ago

By the way, if you're on Discord, you can find me on Audacity channel (invite: https://discord.gg/dFudcMEW), and can always PM me stuff you don't want to post here.

The3IC commented 1 month ago

Just for fun, I updated this video https://youtu.be/d3hLO_TmzIA with

English - unedited transcript generated with large-v2
English (Australian) - unedited transcript from Resolve
Youtube autogenerated

OpenVINO (when it works) wins, least editing work to get it all to be 100%

The3IC commented 1 month ago

@RyanMetcalfeInt8 I'm too old for Discord :-)

RyanMetcalfeInt8 commented 1 month ago

Okay, I was at least able to reproduce:

Basically, large-v3 with default settings (device=CPU). Okay let me experiment a little bit in hopes to root cause. Thanks!

The3IC commented 1 month ago

This seems to be essentially fixed in v3.5.1-R2.2, the only minor gripe is something that I encountered in large-v2, the last 3 lines/sentences are "triplicated".

triplets

Whisper 1.6 also significantly improved transcript generation times, the test audio (just under 16 min/english/large-v3/cpu went from 24-25 minutes to just over 11 minutes!

Intel iGPU generation times now look (relatively) even worse, with the generation times for GPU (intel on-chip) being now about 2x the "pure CPU" times. Guess this is mainly a whisper issue?

RyanMetcalfeInt8 commented 1 month ago

Thanks for testing it, and confirming that the output looks (mostly) correct.

Glad to hear that the transcription times have improved. For iGPU vs. CPU, it may vary on the exact model / generation of processors. I would expect more recent generations (12th gen core and up) to show better iGPU performance.

Anyway, I'll close this one. Thanks again!

intel / openvino-plugins-ai-audacity

Transcript generated on AMD CPU messed up #187