thewh1teagle / vibe

Transcribe on your own!
https://thewh1teagle.github.io/vibe/
MIT License
1.09k stars 68 forks source link

[Feature Request]: Need to support distil-whisper-large-v3 #260

Closed martjay closed 1 month ago

martjay commented 2 months ago

Describe the feature

This model is 6X faster than ggml.

thewh1teagle commented 2 months ago

Thanks for your interest in improving Vibe! Distilled models should already be supported and listed in the models page. I'm not sure why the reports suggest they're 6x faster—in my tests, the difference isn't even 2x. The speed of the medium model compared to ggml-medium appears to be about the same.

martjay commented 2 months ago

Thanks for your interest in improving Vibe! Distilled models should already be supported and listed in the models page. I'm not sure why the reports suggest they're 6x faster—in my tests, the difference isn't even 2x. The speed of the medium model compared to ggml-medium appears to be about the same.

I don't know, I extracted subtitles from a 1-hour video, which went from the 12 minute Whisper V2 ggml model to 2 minutes. I use this GUI verson. https://github.com/CheshireCC/faster-whisper-GUI

martjay commented 2 months ago

It running VRAM usage is only over 3G.

thewh1teagle commented 2 months ago

It running VRAM usage is only over 3G.

Interesting. what was the size of the model? medium?

Can you try this with vibe?

https://huggingface.co/distil-whisper/distil-large-v3-ggml/resolve/main/ggml-distil-large-v3.bin

martjay commented 2 months ago

It running VRAM usage is only over 3G.

Interesting. what was the size of the model? medium?

Can you try this with vibe?

https://huggingface.co/distil-whisper/distil-large-v3-ggml/resolve/main/ggml-distil-large-v3.bin

No support for identifying secondary directories.

I was wrong. I just use models--Systran--faster-distil-whisper-large-v3.

thewh1teagle commented 2 months ago

You can copy the model file to the models directory of Vibe

martjay commented 2 months ago

You can copy the model file to the models directory of Vibe

I just downloaded and used ggml-distil-large-v3.bin. 16 minute video recognition time is less than one minute.

martjay commented 2 months ago

ggml-large-v3 8-9 minite.

thewh1teagle commented 2 months ago

ggml-large-v3 8-9 minite.

Thanks for sharing. I 'm really surprised. And comparing to the default model that comes with Vibe, ggml-medium.bin?

martjay commented 2 months ago

ggml-large-v3 8-9 minite.

Thanks for sharing. I 'm really surprised. And comparing to the default model that comes with Vibe, ggml-medium.bin?

If I can use large model, why should I use medium model?

There is another issue with Vibe, which is identifying the type of subtitle file. It should be selected before starting, otherwise there will be the problem of single line subtitle segmentation.

martjay commented 2 months ago

no support for .safetensors. just .bin?

thewh1teagle commented 2 months ago

There is another issue with Vibe, which is identifying the type of subtitle file. It should be selected before starting, otherwise there will be the problem of single line subtitle segmentation.

You mean the length of each line right? Maybe we can add 'preset' options in the more options sections. that's good idea.

thewh1teagle commented 2 months ago

no support for .safetensors. just .bin?

Only ggml/gguf/bin supported. you can easily convert safetensors with https://github.com/thewh1teagle/vibe/blob/main/docs/MODELS.md#prepare-your-own-models

martjay commented 2 months ago

There is another issue with Vibe, which is identifying the type of subtitle file. It should be selected before starting, otherwise there will be the problem of single line subtitle segmentation.

You mean the length of each line right? Maybe we can add 'preset' options in the more options sections. that's good idea.

What I mean is, you should choose a file type, such as SRT/VTT/TXT, before recognizing speech. Do you understand?

thewh1teagle commented 1 month ago

What I mean is, you should choose a file type, such as SRT/VTT/TXT, before recognizing speech. Do you understand?

Yes, I understand. I know what's the issue your'e talking about when creating SRT that it's too long. We may provide option to choose preset options in the 'More options'. Can you open new issue about it? Thanks :)

martjay commented 1 month ago

What I mean is, you should choose a file type, such as SRT/VTT/TXT, before recognizing speech. Do you understand?

Yes, I understand. I know what's the issue your'e talking about when creating SRT that it's too long. We may provide option to choose preset options in the 'More options'. Can you open new issue about it? Thanks :)

Hello! Actually, I have many thoughts. Is it necessary to start a new thread to express them? Look at this. I set the length of each subtitle, but there still appears a very long paragraph. This is what I want to say.

Snipaste_2024-09-10_22-05-26

In fact, destil-whisper is very suitable for real-time speech recognition because its speed is really fast. If you can add this function and add translation subtitles, it means there is no language that cannot be understood. And everything is in real time. However, if you are interested in this function, it is still necessary to make each subtitle segment relatively short and distinguish subtitles by several punctuation marks or character counts. It's just that now I still feel there seems to be some problems.