Requested speaker ID, timestamps, per-word links for a difficult 20 minutes video--got nothing but speakers--no text

CuirPork commented 2 months ago

I downloaded the Vibe software and then it downloaded the Open AI model. While it was downloading, I looked at the options and realized that there was an option to identify speakers, so I clicked it. Then it appeared to launch a new window with the message that it needed to download the extended library to the Open AI model. So I left it alone.

Once it was done installing locally, I added a local file that was bodycam footage of a police officer interviewing a motorist and a bicyclist who had been involved in a collision.

It took quite a while before I finally saw "SPEAKER 1:" but no timestamp or text. A little bit more time passes and "SPEAKER 2:" appears, no timestamp or text. Flash forward about 2 hours and Vibe claims that it's done transcribing the 20-minute video. However, the only thing in the text file when I saved was the SPEAKER 1: to SPEAKER 2: indications. No text, no timestamp. Just the speaker separations.

I posted to Reddit and was asked to report that here. Hope this helps, lemme know if I can answer any questions. Thanks.

altunenes commented 2 months ago

interesting. I have seen this kind of behavior if my audio file is not mono, or wrongly converted into mono or very noisy.

danchank commented 2 months ago

+1, same behavior, turned off the time stamp option and transcribes perfectly. Using Windows 11 with Nvidia card

thigger commented 1 month ago

Same behaviour (Windows 11, AMD), solved by turning off word-level timestamps

thewh1teagle commented 1 month ago

Thank you all for writing!

The word-level timestamps wasn't enabled by default right? you enabled them manually?

@CuirPork

Does it fixed the issue if you disable it?

Can you share link to the video / audio? You can share YouTube link or upload to Google Drive and share the link here.

danchank commented 1 month ago

Turning it off fixed it. Don't remember if it was enabled by default.

digiguru commented 1 month ago

It seems to be a combination of Diarisation and Word level timestamps.

When I disable word level timestamps and enable diarisation - I get output with diarisation When I enable word level timestamps and disable diarisation - I get output with word level timestamps When I enable them both I get blank utterances with the speakers listed for each blank utterance.

I'm using m1 Apple chip.

digiguru commented 1 month ago

Can also confirm this bug appears even on small files

thewh1teagle / vibe

Requested speaker ID, timestamps, per-word links for a difficult 20 minutes video--got nothing but speakers--no text #232