Control transcript label lengths

The3IC commented 6 months ago

To be able to use the transcripts usefully for .srt file generation, it should be possible to control the length of the transcript labels as each label will be turned into a single caption in videos etc.

Typical ways to control are:

max characters/label
max words/label
break on full stop yes/no

In addition it would be good if it was possible to control "orphan" words so that the software does not generate single word labels from the last word in a sentence.

The above should be applied both to generate and translate.

RyanMetcalfeInt8 commented 6 months ago

Hi @The3IC,

Good suggestions, thank you. Just to point out, that we generally support max characters/label via 'Max Segment Length' advanced option. See (8) here: https://github.com/intel/openvino-plugins-ai-audacity/tree/main/doc/feature_doc/whisper_transcription#description-of-properties

In general, all of these features are simply exposing features from the underlying whisper framework that we are using, whisper.cpp. If that framework is able to support the features that you suggest, then we can easily add support for it at the Audacity plugin-level.

Thanks, Ryan

The3IC commented 6 months ago

Ah, yes, this was a good step in the direction, the location of the documentation and the UI implementation which had no hint as to what the Max Segment Length referred to had me confused for a second.

So actually, now there is only some fine tuning left, the char length is mostly an adequate approach but it has one (in my view) issue and that is that it often leaves "orphan" single words for captions when the last word of a sentence does not fit the max char amount and the algo breaks captions on punctuations (full stops).

These single words tend then to "wizz" by the viewer. Most other caption generators suffer from this same problem as well, so you are in good company. So I don't know if it is your plugin or the basic framework that could do something here but innovating a solution where the last caption from a sentence (as a general rule) would be 2-3 words instead of the single orphan would be nice. Ofcourse there will always be edge cases.

Would it help if I add a feature request about this somewhere in Whisper-land?

ps: been testing the plugin now and I am quite impressed with the quality of the transcripts and in particular the translation. I do videos in English and Finnish and for Finnish this is a real treat to be able to easily generate both the Finnish and English captions! I use Resolve for video editing and they do not support Finnish for captions nor translations so this is a very nice addition. The general observation here is that the smaller models (mainly tested "base") are pretty much useless, but the large models give quite impressive results. One observation is that eg the Resolve model struggles a bit when the English speaker is not a native but has an accent and it seems that Whisper handles these situations better (with the large models). Btw, is there any functional difference between the different large models or is V3 just the "pinnacle" with all goodies?

pps: as I'm rambling on here, noise reductions seems to work with OpenCL/AMD but transcript generation not. I know who is paying your bills but it would still be nice if OpenCL was supported (CUDA being totally proprietary can be left alone :-) also for transcripts/translation ).

RyanMetcalfeInt8 commented 6 months ago

Hi @The3IC,

These single words tend then to "wizz" by the viewer. Most other caption generators suffer from this same problem as well, so you are in good company. So I don't know if it is your plugin or the basic framework that could do something here but innovating a solution where the last caption from a sentence (as a general rule) would be 2-3 words instead of the single orphan would be nice. Ofcourse there will always be edge cases.

Right, and I think the 'edge cases' is what would give me pause. I suppose it's not hard to add some kind 'experimental' support for detecting and 'joining' single words to the previous (or next?) segment.

Would it help if I add a feature request about this somewhere in Whisper-land?

My suggestion would be to ask the question / submit the request for such a feature, as a whisper.cpp discussion: https://github.com/ggerganov/whisper.cpp/discussions

It's a pretty active community, and you may inspire something to be implemented in that framework -- or get some guidance about what parameters we need to tweak, if it's something that is already supported somehow.

ps: been testing the plugin now and I am quite impressed with the quality of the transcripts and in particular the translation. I do videos in English and Finnish and for Finnish this is a real treat to be able to easily generate both the Finnish and English captions! I use Resolve for video editing and they do not support Finnish for captions nor translations so this is a very nice addition. The general observation here is that the smaller models (mainly tested "base") are pretty much useless, but the large models give quite impressive results. One observation is that eg the Resolve model struggles a bit when the English speaker is not a native but has an accent and it seems that Whisper handles these situations better (with the large models). Btw, is there any functional difference between the different large models or is V3 just the "pinnacle" with all goodies?

So just for context, large V1, V2, and V3 were all trained and released by OpenAI. From what I can tell, V2 is considered to provide a good leap in accuracy over V1. And it seems that there isn't any general consensus about whether V3 is better than V2 -- and it could be sort of language dependent. But, I am definitely not an expert here. It was easy enough to add support for all of them, so that's what I did :) Glad you are finding the large models useful for translation.

pps: as I'm rambling on here, noise reductions seems to work with OpenCL/AMD but transcript generation not. I know who is paying your bills but it would still be nice if OpenCL was supported (CUDA being totally proprietary can be left alone :-) also for transcripts/translation ).

You're referring to AMD GPU support? Technically, OpenVINO's GPU plugin supports OpenCL -- but depending on the model, they might pull in some intel-specific extension or something -- which is why you might observe failures when you use AMD cards (btw -- I'm pretty much making an educated guess here).

The3IC commented 6 months ago

@RyanMetcalfeInt8 in case you want to chip in: https://github.com/ggerganov/whisper.cpp/discussions/2139

dbaluk commented 2 days ago

Single words are split into 2 labels, how can I prevent it?

RyanMetcalfeInt8 commented 2 days ago

Hi @dbaluk, what is the Max Segment Length set to in advanced options? Note that if it's set to 1 (i.e. for 'word level' transcription), that this is an experimental feature from whisper.cpp -- I suppose it's possible that this could split a single word across 2 labels in cases where a single word is made up of multiple tokens.

dbaluk commented 1 day ago

I set max segment length = 50 and it splits words at the end of some lines.

When I set to 0, lines are long/very long but it does not split single words. But I have to use Subtitle editor to automatically shorten (split) lines in correct way.

setting max segment =1 splits all words into syllables and occasionally single letters. I transcribe Polish language.

Do you know any solution?

intel / openvino-plugins-ai-audacity

Control transcript label lengths #165