Open thewh1teagle opened 3 weeks ago
@thewh1teagle Hi, as I can see in whisper.cpp split_on_word
works with max_len
and not with max_tokens
parameter. Also they implicitly enable token_timestamps
when max_len > 0
wparams.token_timestamps = params.output_wts || params.output_jsn_full || params.max_len > 0;
wparams.max_len = params.output_wts && params.max_len == 0 ? 60 : params.max_len;
wparams.split_on_word = params.split_on_word;
and later split_on_words
works only when token_timestams == true
and max_len > 0
So try to enable token_timestamps
and split_on_words
flags and set max_len
to the desired maximum segment length in characters. Hope it helps.
@arizhih Thanks! I'm looking for split it per word so users can easily select max words per sentence. it's useful for creating video captions where you have limit in the width in the screen. Splitting it per letters is harder / not accurate. Is there a way to achieve it through word splitting?
I have another idea. I can enable token timestamps and take how many words I want. however It may be less accurate and may split in the middle of sentence, does whisper.cpp split sentences smarter by default?
By default whisper produce from 1 to N segments with different length.
When you set token_timestamps
and max_len
whisper will split large segments into multiple segments, each of them not greater than max_len
. If you add split_on_word
then each segment will be a little bit larger( to the end of the last word).
It doesn't affect how it produce sentences at all, just how it returns segments.
It doesn't affect how it produce sentences at all, just how it returns segments.
Thanks, so I understand that it's not the right way to produce max words per sentence. I thought about simpler way: getting token timestamps from whisper and then I can build the sentences in the way I want with max words per sentence.
However, when using token timestamps it produce incorrect tokens, or at least it looks incorrect since it count symbols as single tokens.
Created with Vibe app.
or at least it looks incorrect since it count symbols as single tokens.
It’s because token is not a word. Whisper has about 54000 tokens and all words is built from this tokens.
Maybe if you set max_len
to 1 and enable option
split_on_word
it produce one segment for each word.
Maybe if you set
max_len
to 1 and enable optionsplit_on_word
it produce one segment for each word.
Same
params.set_token_timestamps(true);
params.set_split_on_word(true);
params.set_max_len(1);
Maybe I have mistake in how I consume the segments That's how I create the word segments:
Maybe I have mistake in how I consume the segments That's how I create the word segments:
Yes, you get tokens, but you need to get segment text. Try to use this
let text = state.full_get_segment_text_lossy(s).context("failed to get segment")?;
Yes, you get tokens, but you need to get segment text. Try to use this
let text = state.full_get_segment_text_lossy(s).context("failed to get segment")?;
Notice that I said word segments, in general I already use there get_segment_text in the else statement. Do I need to use get_segment_text even in the loop of the num_tokens?
My proposal was to use max_len
1 and split_on_word
and I think that with this options each segment will be a single word.
So you don’t need to use tokens at all, only segments.
for s in 0..num_segments {
let text = state.full_get_segment_text_lossy(s).context("failed to get segment")?;
let start = state.full_get_segment_t0(s).context("failed to get start timestamp")?;
let stop = state.full_get_segment_t1(s).context("failed to get end timestamp")?;
segments.push(Segment { text, start, stop });
}
If this doesn’t help tomorrow I’ll give you example how to create words from tokens.
@arizhih
It worked! I tried so many options there but didn't thought about this one
Thank you so much :)
I'm trying to enable
set_max_tokens
along withset_split_on_word
to provide a way to set max word per sentence but when I setsplit_on_word
totrue
andmax_tokens
to anything more than 0 then the transcription happens very fast but with gibberish and only 2 sentence for long audioIn original
whisper.cpp
cli
program it works as expected with max length per line.