thewh1teagle / vibe

Transcribe on your own!
https://thewh1teagle.github.io/vibe/
MIT License
945 stars 56 forks source link

[Bug]: Can get stuck during musical sections. #170

Open MCCMikey opened 2 months ago

MCCMikey commented 2 months ago

What happened?

I supplied it a two hour recording of a radio program.

For about 30 minutes of the recording it repeatedly output the line [Music] rather than transcribing the spoken words between tracks.

From about the 40 minute mark it resumed normal output, except for one point where it repeated the line

They're only living in a world where they can say goodbye.

about 40 times.

I have to say though that it does a remarkably good job with such varied content. If this can be fixed I plan to ask this program to periodically verify that our presenters are playing the sponsor messages that they are meant to on our community radio station. I'm definitely going to pay for this app via the support link. I've been looking for something like this for ages.

2020-09-24 Transcription.docx

Steps to reproduce

Feed it the audio file https://drive.google.com/file/d/1nqtLWZTUEJjvVWGDUJRQTRdyJNGbCESM/view?usp=sharing and ask it to transcribe.

What OS are you seeing the problem on?

Window

Relevant log output

No response

thewh1teagle commented 2 months ago

For about 30 minutes of the recording it repeatedly output the line [Music] rather than transcribing the spoken words between tracks.

I understand that you've encountered challenges transcribing audio with music and background noise. Unfortunately, the Whisper AI model isn't the best fit for this task, as discussed here.

I propose combining a VAD AI model (Voice Activity Detector) with a denoiser model (for noise filtering and speech enhancement). I'd love to hear what other developers think about this approach—please feel free to share your thoughts.

It's worth noting that this isn't a simple task, and I don't believe there's an existing solution for this worldwide, at least not in the non-commercial realm.

For the VAD, we can utilize the Silero VAD model with Sherpa-rs, and for speech enhancement, we can leverage DeepFilterNet

I have to say though that it does a remarkably good job with such varied content. If this can be fixed I plan to ask this program > to periodically verify that our presenters are playing the sponsor messages that they are meant to on our community radio station. I'm definitely going to pay for this app via the support link. I've been looking for something like this for ages.

I'm glad you liked it! and thank you very much for your support in improving Vibe it's greatly appreciated!