Closed The3IC closed 1 month ago
Hi @The3IC,
Which whisper model were you using for this issue, and is the behavior consistent across different whisper models (base, small, medium, etc.)?
Thanks, Ryan
I used only V3 in both cases, do you want me to try some smaller model on AMD?
It might be worth a quick test to see if the issue is somewhat consistent using a smaller model -- but I'd have to think about what to do with that info 😆
Probably the more important question is this -- Do you find this behavior is tied to the audio sample that you are using, or if you were to substitute it for another sample of similar duration, does the issue go away?
OK, so ran a few tests with the ailing audio:
base: about 1+ minute, generated transcript quality low, does not seem to obey "Max segment length" parameter, the "repeat" error did not occur
large-v2: about 10 minutes, transcript quality good but not as good as with large-V3 (done on intel), the "repeat" error did not occur
large-v3: about 24 minutes, best transcript quality on Intel iron but the "messed up" error is there so it is at least repeatable on my system.
So to answer your question, the problem only occurs for large-v3 (for this video).
ps: did a small video on my workflow together with Resolve V19: https://youtu.be/d3hLO_TmzIA
Btw, my setting on the AMD machine are here:
OK, one more status update before I hit the sack. Had to go back to the Intel iron to regenerate the transcript (I first saved it as txt and then deleted the output :-) ) and now the same thing happened also on the Intel iron!
It starts at a different point (the repeated text is differerent) but effect the same. Thoughts:
So running on large-V3 also here.
One more data point, this time Intel hw, large-V3, CPU, no initial prompt, segment length 60. End of transcript again messed up, this time starting at a different point.
Hmm -- segment length of 60 -- this is minutes or seconds? Edit: Oh sorry, segment length not duration -- disregard this question. What was the duration of this one?
Also, is this intermittent or repeatable? If you were to run the same exact case multiple times, does it produce consistent results?
Documentation says segment length is characters and that seems to be the effect I get? It's still the same audio sample, just shy of 16 minutes to be exact. And seems it is repeatable as in different results based on parameter combo (CPU/GPU, Advanced settings).
And seems it is repeatable as in different results based on parameter combo (CPU/GPU, Advanced settings).
So just to be 100% sure I understand -- Given a single 'failing' configuration, it gives the same exact results if you were to try it 2 or 3 times?
Yes, within the testing that I have done (it takes half an hour to run one test so sample size is not huge).
Thanks, I've downloaded and deleted your comment.
Yes, within the testing that I have done (it takes half an hour to run one test so sample size is not huge).
Got it, thanks -- yeah no problem, figured 16 mins takes quite a while with large-v3.
By the way, if you're on Discord, you can find me on Audacity channel (invite: https://discord.gg/dFudcMEW), and can always PM me stuff you don't want to post here.
Just for fun, I updated this video https://youtu.be/d3hLO_TmzIA with
OpenVINO (when it works) wins, least editing work to get it all to be 100%
@RyanMetcalfeInt8 I'm too old for Discord :-)
Okay, I was at least able to reproduce:
Basically, large-v3 with default settings (device=CPU). Okay let me experiment a little bit in hopes to root cause. Thanks!
This seems to be essentially fixed in v3.5.1-R2.2, the only minor gripe is something that I encountered in large-v2, the last 3 lines/sentences are "triplicated".
Whisper 1.6 also significantly improved transcript generation times, the test audio (just under 16 min/english/large-v3/cpu went from 24-25 minutes to just over 11 minutes!
Intel iGPU generation times now look (relatively) even worse, with the generation times for GPU (intel on-chip) being now about 2x the "pure CPU" times. Guess this is mainly a whisper issue?
Thanks for testing it, and confirming that the output looks (mostly) correct.
Glad to hear that the transcription times have improved. For iGPU vs. CPU, it may vary on the exact model / generation of processors. I would expect more recent generations (12th gen core and up) to show better iGPU performance.
Anyway, I'll close this one. Thanks again!
Tried to transcribe a 14 minute audio (English, non-native speaker) and I get totally different results on AMD CPU and Intel GPU.
On the AMD, everything is fine for the first few minutes but then the process get's "stuck" and the last 10+ minutes are just the same label content being duplicated into every label. The labels are "connected" to what I assume to be the right place on the audio, but the content (textual timestamp and text) just repeat.
CPU load at about 60-65% and memory 20/64 Gbyte, so resource starvation would not seem to be the cause. Processor is a Ryzen 5 5600.
On Intel iron, it works as expected.
If somebody wants it, I can send you the audio and srt's but would prefer to do it somewhere else than in a post like this?