thewh1teagle / vibe

Transcribe on your own!
https://thewh1teagle.github.io/vibe/
MIT License
917 stars 55 forks source link

[Bug]: App keeps repeating two same sentences after some time and till runtime end. #98

Closed faxotherapy closed 3 months ago

faxotherapy commented 4 months ago

What happened?

Attempting to transcribe a MP4 video lasting about 1 h 30. After 20' of transcribing, app keeps repeating the same sentence precisely 686 times. After 46', the app keeps repeating again another same sentence precisely 1401 times. It seems the app likes those two sentences very much. Unfortunately, the log file does not reflect at all this issue.

Is it OK that I feed the app directly an MP4 video ? Or should I extract the audio track and feed it into the app?

Context:

Example:

This one kept repeating 1401 times:

986
0:46:35,780 --> 0:46:37,780
After identifying and writing down your aesthetic impact,

987
0:46:37,780 --> 0:46:39,780
After identifying and writing down your aesthetic impact,

988
0:46:39,780 --> 0:46:41,780
After identifying and writing down your aesthetic impact,

989
0:46:41,780 --> 0:46:43,780
After identifying and writing down your aesthetic impact,

990
0:46:43,780 --> 0:46:45,780
After identifying and writing down your aesthetic impact,

991
0:46:45,780 --> 0:46:47,780
After identifying and writing down your aesthetic impact,

This other one kept repeating 686 times:

0:46:19,780 --> 0:46:21,780
You should never stop writing for more than three to four seconds at a time.

979
0:46:21,780 --> 0:46:23,780
You should never stop writing for more than three to four seconds at a time.

980
0:46:23,780 --> 0:46:25,780
You should never stop writing for more than three to four seconds at a time.

981
0:46:25,780 --> 0:46:27,780
You should never stop writing for more than three to four seconds at a time.

982
0:46:27,780 --> 0:46:29,780
You should never stop writing for more than three to four seconds at a time.

983
0:46:29,780 --> 0:46:31,780
You should never stop writing for more than three to four seconds at a time.

984
0:46:31,780 --> 0:46:33,780
You should never stop writing for more than three to four seconds at a time.

I verified. This is not the case at all in the video; guy keeps talking normally using other sentences.

Thank you for sharing any suggestion.

Steps to reproduce

  1. Try transcribing a media with about 1 hour and 30 min runtime
  2. Sentences should keep repeating.

What OS are you seeing the problem on?

Linux

Relevant log output

When selecting the video, this is what happens:

** (WebKitWebProcess:601589): WARNING **: 22:36:30.150: The GStreamer FDK AAC plugin is missing, AAC playback is unlikely to work.
WebKit wasn't able to find a WebVTT encoder. Subtitles handling will be degraded unless gst-plugins-bad is installed.
GStreamer element fakevideosink not found. Please install it

I guess I should repeat the transcription again with log enabled using RUST_LOG=debug vibe

faxotherapy commented 4 months ago

I did try again with the medium version (ggml-medium.bin) this time, and it seems to work. Any reason why the larger model would not work? I've got 24-GB RAM. Thank you.

thewh1teagle commented 4 months ago

It seems that the issue might be with the model itself, as discussed in this GitHub issue.

A potential solution has been suggested in this comment. Currently, we don't have an option to pass max tokens to Whisper, but I can add it if needed.

Did you downloaded the large model from the same link opened from settings?

thewh1teagle commented 4 months ago

I released new version with option to set max context tokens length

You can install and run with

cd /tmp
wget -q --show-progress https://github.com/thewh1teagle/vibe/releases/download/v2.0.1/vibe_2.0.1_amd64.deb
sudo apt install ./vibe_2.0.1*.deb --reinstall
RUST_LOG=debug vibe

Then choose again large model, and in advanced options in main window before transcribe set maximum context to 64 or 32

faxotherapy commented 4 months ago

I downloaded the large model from https://huggingface.co/ggerganov/whisper.cpp/tree/main

I'm gonna try your latest release today with the larger model. But, wouldn't the medium version be enough instead of using the larger one? Thanks.

thewh1teagle commented 3 months ago

I'm gonna try your latest release today with the larger model. But, wouldn't the medium version be enough instead of using the larger one? Thanks.

I believe that the medium version is sufficient in most cases. that's why I set it to be the default model in the app. Also, transcribing with larger model takes more time.

faxotherapy commented 3 months ago

Hi, thx for your reply. In fact, I'm gonna get rid of the large model, which either used with 32 or 64 max context, provided very unsatisfactory results (repetitions). Though using 32 max context provided much less repetitions with 64 max context.

Sticking with the medium version is best, with or without using, e.g. a 32-max context.