max tokens and split on word params doesn't work

thewh1teagle commented 3 weeks ago

I'm trying to enable set_max_tokens along with set_split_on_word to provide a way to set max word per sentence but when I set split_on_word to true and max_tokens to anything more than 0 then the transcription happens very fast but with gibberish and only 2 sentence for long audio

In original whisper.cpp cli program it works as expected with max length per line.

arizhih commented 3 days ago

@thewh1teagle Hi, as I can see in whisper.cpp split_on_word works with max_len and not with max_tokens parameter. Also they implicitly enable token_timestamps when max_len > 0

wparams.token_timestamps = params.output_wts || params.output_jsn_full || params.max_len > 0;
wparams.max_len          = params.output_wts && params.max_len == 0 ? 60 : params.max_len;
wparams.split_on_word    = params.split_on_word;

and later split_on_words works only when token_timestams == true and max_len > 0

https://github.com/ggerganov/whisper.cpp/blob/bf4cb4abad4e35c74b387df034cc4ac7b22e5fe6/whisper.cpp#L6224

So try to enable token_timestamps and split_on_words flags and set max_len to the desired maximum segment length in characters. Hope it helps.

thewh1teagle commented 2 days ago

@arizhih Thanks! I'm looking for split it per word so users can easily select max words per sentence. it's useful for creating video captions where you have limit in the width in the screen. Splitting it per letters is harder / not accurate. Is there a way to achieve it through word splitting?

I have another idea. I can enable token timestamps and take how many words I want. however It may be less accurate and may split in the middle of sentence, does whisper.cpp split sentences smarter by default?

arizhih commented 2 days ago

By default whisper produce from 1 to N segments with different length.

When you set token_timestamps and max_len whisper will split large segments into multiple segments, each of them not greater than max_len. If you add split_on_word then each segment will be a little bit larger( to the end of the last word).

It doesn't affect how it produce sentences at all, just how it returns segments.

thewh1teagle commented 1 day ago

It doesn't affect how it produce sentences at all, just how it returns segments.

Thanks, so I understand that it's not the right way to produce max words per sentence. I thought about simpler way: getting token timestamps from whisper and then I can build the sentences in the way I want with max words per sentence.

However, when using token timestamps it produce incorrect tokens, or at least it looks incorrect since it count symbols as single tokens.

regular

```json [ { "start": 0, "stop": 520, "text": " It's whoever, not whomever. That's whomever. No whomever is never actually right." }, { "start": 520, "stop": 934, "text": " Well sometimes it's right. Michael is right. It's a made-up word used to trick" }, { "start": 934, "stop": 1418, "text": " students. No actually whomever is the formal version of the word. Obviously" }, { "start": 1418, "stop": 1792, "text": " it's a real word, but I don't know when to use it correctly. Not a native speaker." }, { "start": 1792, "stop": 2200, "text": " I know what's right, but I'm not gonna say because you're all jerks who didn't" }, { "start": 2200, "stop": 2540, "text": " come see my band last night. Do you really know which one is correct? I don't know." }, { "start": 2540, "stop": 2942, "text": " It's whom when it's the object of the sentence and who when is the subject. That" }, { "start": 2942, "stop": 4942, "text": " sounds right. Well it sounds right but is it? How did Ryan use it as an object? As an object. Ryan used me as an object. How did he use it again? It was Ryan wanted Michael the subject to explain the computer system, the object, to whomever, meaning us, the indirect object, which is the correct usage of the word." } ] ```

token timestamps

```json [ { "start": 0, "stop": 14, "text": " It" }, { "start": 14, "stop": 28, "text": "'s" }, { "start": 28, "stop": 79, "text": " whoever" }, { "start": 93, "stop": 93, "text": "," }, { "start": 94, "stop": 115, "text": " not" }, { "start": 122, "stop": 129, "text": " wh" }, { "start": 129, "stop": 147, "text": "ome" }, { "start": 152, "stop": 173, "text": "ver" }, { "start": 173, "stop": 200, "text": "." }, { "start": 200, "stop": 223, "text": " That" }, { "start": 223, "stop": 233, "text": "'s" }, { "start": 234, "stop": 245, "text": " wh" }, { "start": 245, "stop": 262, "text": "ome" }, { "start": 262, "stop": 278, "text": "ver" }, { "start": 279, "stop": 298, "text": "." }, { "start": 304, "stop": 313, "text": " No" }, { "start": 313, "stop": 326, "text": " wh" }, { "start": 326, "stop": 345, "text": "ome" }, { "start": 345, "stop": 364, "text": "ver" }, { "start": 364, "stop": 365, "text": " is" }, { "start": 380, "stop": 410, "text": " never" }, { "start": 410, "stop": 463, "text": " actually" }, { "start": 463, "stop": 496, "text": " right" }, { "start": 496, "stop": 520, "text": "." }, { "start": 520, "stop": 544, "text": " Well" }, { "start": 544, "stop": 597, "text": " sometimes" }, { "start": 597, "stop": 609, "text": " it" }, { "start": 609, "stop": 615, "text": "'s" }, { "start": 623, "stop": 649, "text": " right" }, { "start": 649, "stop": 658, "text": "." }, { "start": 667, "stop": 706, "text": " Michael" }, { "start": 707, "stop": 718, "text": " is" }, { "start": 718, "stop": 741, "text": " right" }, { "start": 752, "stop": 765, "text": "." }, { "start": 765, "stop": 777, "text": " It" }, { "start": 777, "stop": 788, "text": "'s" }, { "start": 788, "stop": 794, "text": " a" }, { "start": 794, "stop": 818, "text": " made" }, { "start": 818, "stop": 819, "text": "-" }, { "start": 831, "stop": 834, "text": "up" }, { "start": 834, "stop": 855, "text": " word" }, { "start": 858, "stop": 879, "text": " used" }, { "start": 886, "stop": 894, "text": " to" }, { "start": 894, "stop": 931, "text": " trick" }, { "start": 936, "stop": 990, "text": " students" }, { "start": 990, "stop": 1008, "text": "." }, { "start": 1010, "stop": 1012, "text": " No" }, { "start": 1037, "stop": 1079, "text": " actually" }, { "start": 1095, "stop": 1095, "text": " wh" }, { "start": 1095, "stop": 1116, "text": "ome" }, { "start": 1132, "stop": 1137, "text": "ver" }, { "start": 1137, "stop": 1151, "text": " is" }, { "start": 1151, "stop": 1172, "text": " the" }, { "start": 1172, "stop": 1214, "text": " formal" }, { "start": 1214, "stop": 1263, "text": " version" }, { "start": 1263, "stop": 1277, "text": " of" }, { "start": 1277, "stop": 1298, "text": " the" }, { "start": 1298, "stop": 1326, "text": " word" }, { "start": 1326, "stop": 1347, "text": "." }, { "start": 1347, "stop": 1417, "text": " Obviously" }, { "start": 1418, "stop": 1428, "text": " it" }, { "start": 1428, "stop": 1435, "text": "'s" }, { "start": 1440, "stop": 1443, "text": " a" }, { "start": 1443, "stop": 1464, "text": " real" }, { "start": 1464, "stop": 1485, "text": " word" }, { "start": 1485, "stop": 1494, "text": "," }, { "start": 1494, "stop": 1505, "text": " but" }, { "start": 1509, "stop": 1512, "text": " I" }, { "start": 1522, "stop": 1530, "text": " don" }, { "start": 1530, "stop": 1538, "text": "'t" }, { "start": 1547, "stop": 1561, "text": " know" }, { "start": 1561, "stop": 1582, "text": " when" }, { "start": 1582, "stop": 1592, "text": " to" }, { "start": 1592, "stop": 1607, "text": " use" }, { "start": 1607, "stop": 1617, "text": " it" }, { "start": 1617, "stop": 1664, "text": " correctly" }, { "start": 1664, "stop": 1678, "text": "." }, { "start": 1678, "stop": 1694, "text": " Not" }, { "start": 1694, "stop": 1698, "text": " a" }, { "start": 1699, "stop": 1730, "text": " native" }, { "start": 1730, "stop": 1761, "text": " speaker" }, { "start": 1767, "stop": 1792, "text": "." }, { "start": 1792, "stop": 1798, "text": " I" }, { "start": 1800, "stop": 1823, "text": " know" }, { "start": 1823, "stop": 1848, "text": " what" }, { "start": 1848, "stop": 1860, "text": "'s" }, { "start": 1860, "stop": 1881, "text": " right" }, { "start": 1889, "stop": 1903, "text": "," }, { "start": 1904, "stop": 1910, "text": " but" }, { "start": 1923, "stop": 1927, "text": " I" }, { "start": 1927, "stop": 1939, "text": "'m" }, { "start": 1939, "stop": 1957, "text": " not" }, { "start": 1957, "stop": 1988, "text": " gonna" }, { "start": 1988, "stop": 2005, "text": " say" }, { "start": 2005, "stop": 2023, "text": " because" }, { "start": 2050, "stop": 2067, "text": " you" }, { "start": 2067, "stop": 2085, "text": "'re" }, { "start": 2085, "stop": 2103, "text": " all" }, { "start": 2103, "stop": 2120, "text": " jer" }, { "start": 2125, "stop": 2133, "text": "ks" }, { "start": 2133, "stop": 2148, "text": " who" }, { "start": 2157, "stop": 2175, "text": " didn" }, { "start": 2177, "stop": 2199, "text": "'t" }, { "start": 2206, "stop": 2218, "text": " come" }, { "start": 2218, "stop": 2231, "text": " see" }, { "start": 2231, "stop": 2240, "text": " my" }, { "start": 2240, "stop": 2258, "text": " band" }, { "start": 2258, "stop": 2276, "text": " last" }, { "start": 2276, "stop": 2293, "text": " night" }, { "start": 2301, "stop": 2312, "text": "." }, { "start": 2312, "stop": 2321, "text": " Do" }, { "start": 2321, "stop": 2334, "text": " you" }, { "start": 2334, "stop": 2361, "text": " really" }, { "start": 2361, "stop": 2379, "text": " know" }, { "start": 2379, "stop": 2402, "text": " which" }, { "start": 2402, "stop": 2411, "text": " one" }, { "start": 2417, "stop": 2424, "text": " is" }, { "start": 2424, "stop": 2456, "text": " correct" }, { "start": 2456, "stop": 2457, "text": "?" }, { "start": 2471, "stop": 2473, "text": " I" }, { "start": 2473, "stop": 2486, "text": " don" }, { "start": 2486, "stop": 2504, "text": "'t" }, { "start": 2504, "stop": 2507, "text": " know" }, { "start": 2524, "stop": 2540, "text": "." }, { "start": 2540, "stop": 2551, "text": " It" }, { "start": 2551, "stop": 2561, "text": "'s" }, { "start": 2574, "stop": 2584, "text": " whom" }, { "start": 2591, "stop": 2608, "text": " when" }, { "start": 2608, "stop": 2619, "text": " it" }, { "start": 2619, "stop": 2630, "text": "'s" }, { "start": 2630, "stop": 2647, "text": " the" }, { "start": 2647, "stop": 2682, "text": " object" }, { "start": 2682, "stop": 2693, "text": " of" }, { "start": 2693, "stop": 2710, "text": " the" }, { "start": 2710, "stop": 2756, "text": " sentence" }, { "start": 2756, "stop": 2773, "text": " and" }, { "start": 2773, "stop": 2790, "text": " who" }, { "start": 2790, "stop": 2813, "text": " when" }, { "start": 2813, "stop": 2824, "text": " is" }, { "start": 2824, "stop": 2841, "text": " the" }, { "start": 2841, "stop": 2879, "text": " subject" }, { "start": 2881, "stop": 2905, "text": "." }, { "start": 2917, "stop": 2942, "text": " That" }, { "start": 2942, "stop": 2969, "text": " sounds" }, { "start": 2969, "stop": 2992, "text": " right" }, { "start": 2997, "stop": 3005, "text": "." }, { "start": 3005, "stop": 3016, "text": " Well" }, { "start": 3026, "stop": 3032, "text": " it" }, { "start": 3032, "stop": 3059, "text": " sounds" }, { "start": 3059, "stop": 3082, "text": " right" }, { "start": 3082, "stop": 3095, "text": " but" }, { "start": 3095, "stop": 3103, "text": " is" }, { "start": 3104, "stop": 3113, "text": " it" }, { "start": 3113, "stop": 3126, "text": "?" }, { "start": 3126, "stop": 3139, "text": " How" }, { "start": 3139, "stop": 3152, "text": " did" }, { "start": 3152, "stop": 3170, "text": " Ryan" }, { "start": 3170, "stop": 3183, "text": " use" }, { "start": 3183, "stop": 3192, "text": " it" }, { "start": 3192, "stop": 3201, "text": " as" }, { "start": 3201, "stop": 3210, "text": " an" }, { "start": 3210, "stop": 3237, "text": " object" }, { "start": 3237, "stop": 3250, "text": "?" }, { "start": 3250, "stop": 3256, "text": " As" }, { "start": 3260, "stop": 3268, "text": " an" }, { "start": 3268, "stop": 3295, "text": " object" }, { "start": 3295, "stop": 3320, "text": "." }, { "start": 3335, "stop": 3358, "text": " Ryan" }, { "start": 3358, "stop": 3392, "text": " used" }, { "start": 3392, "stop": 3409, "text": " me" }, { "start": 3409, "stop": 3426, "text": " as" }, { "start": 3426, "stop": 3442, "text": " an" }, { "start": 3442, "stop": 3464, "text": " object" }, { "start": 3503, "stop": 3521, "text": "." }, { "start": 3521, "stop": 3547, "text": " How" }, { "start": 3547, "stop": 3566, "text": " did" }, { "start": 3573, "stop": 3587, "text": " he" }, { "start": 3598, "stop": 3614, "text": " use" }, { "start": 3627, "stop": 3633, "text": " it" }, { "start": 3633, "stop": 3675, "text": " again" }, { "start": 3676, "stop": 3708, "text": "?" }, { "start": 3708, "stop": 3729, "text": " It" }, { "start": 3730, "stop": 3763, "text": " was" }, { "start": 3763, "stop": 3808, "text": " Ryan" }, { "start": 3808, "stop": 3836, "text": " wanted" }, { "start": 3840, "stop": 3878, "text": " Michael" }, { "start": 3878, "stop": 3896, "text": " the" }, { "start": 3896, "stop": 3952, "text": " subject" }, { "start": 3964, "stop": 3976, "text": " to" }, { "start": 3976, "stop": 4036, "text": " explain" }, { "start": 4036, "stop": 4045, "text": " the" }, { "start": 4051, "stop": 4085, "text": " computer" }, { "start": 4085, "stop": 4105, "text": " system" }, { "start": 4112, "stop": 4121, "text": "," }, { "start": 4121, "stop": 4136, "text": " the" }, { "start": 4136, "stop": 4182, "text": " object" }, { "start": 4188, "stop": 4202, "text": "," }, { "start": 4214, "stop": 4218, "text": " to" }, { "start": 4218, "stop": 4231, "text": " wh" }, { "start": 4241, "stop": 4259, "text": "ome" }, { "start": 4259, "stop": 4281, "text": "ver" }, { "start": 4289, "stop": 4300, "text": "," }, { "start": 4300, "stop": 4359, "text": " meaning" }, { "start": 4359, "stop": 4375, "text": " us" }, { "start": 4375, "stop": 4391, "text": "," }, { "start": 4391, "stop": 4424, "text": " the" }, { "start": 4424, "stop": 4503, "text": " indirect" }, { "start": 4506, "stop": 4568, "text": " object" }, { "start": 4568, "stop": 4584, "text": "," }, { "start": 4591, "stop": 4636, "text": " which" }, { "start": 4641, "stop": 4659, "text": " is" }, { "start": 4659, "stop": 4690, "text": " the" }, { "start": 4690, "stop": 4755, "text": " correct" }, { "start": 4755, "stop": 4755, "text": " usage" }, { "start": 4755, "stop": 4755, "text": " of" }, { "start": 4755, "stop": 4755, "text": " the" }, { "start": 4755, "stop": 4755, "text": " word" }, { "start": 4755, "stop": 4755, "text": "." } ] ```

Created with Vibe app.

arizhih commented 1 day ago

or at least it looks incorrect since it count symbols as single tokens.

It’s because token is not a word. Whisper has about 54000 tokens and all words is built from this tokens.

Maybe if you set max_len to 1 and enable option split_on_word it produce one segment for each word.

thewh1teagle commented 1 day ago

Maybe if you set max_len to 1 and enable option split_on_word it produce one segment for each word.

Same

  params.set_token_timestamps(true);
  params.set_split_on_word(true);
  params.set_max_len(1);

transcript.json

```json [ { "start": 0, "stop": 14, "text": " It" }, { "start": 14, "stop": 28, "text": "'s" }, { "start": 28, "stop": 79, "text": " whoever" }, { "start": 93, "stop": 93, "text": "," }, { "start": 94, "stop": 115, "text": " not" }, { "start": 122, "stop": 129, "text": " wh" }, { "start": 129, "stop": 147, "text": "ome" }, { "start": 152, "stop": 173, "text": "ver" }, { "start": 173, "stop": 200, "text": "." }, { "start": 200, "stop": 223, "text": " That" }, { "start": 223, "stop": 233, "text": "'s" }, { "start": 234, "stop": 245, "text": " wh" }, { "start": 245, "stop": 262, "text": "ome" }, { "start": 262, "stop": 278, "text": "ver" }, { "start": 279, "stop": 298, "text": "." }, { "start": 304, "stop": 313, "text": " No" }, { "start": 313, "stop": 326, "text": " wh" }, { "start": 326, "stop": 345, "text": "ome" }, { "start": 345, "stop": 364, "text": "ver" }, { "start": 364, "stop": 365, "text": " is" }, { "start": 380, "stop": 410, "text": " never" }, { "start": 410, "stop": 463, "text": " actually" }, { "start": 463, "stop": 496, "text": " right" }, { "start": 496, "stop": 520, "text": "." }, { "start": 520, "stop": 544, "text": " Well" }, { "start": 544, "stop": 597, "text": " sometimes" }, { "start": 597, "stop": 609, "text": " it" }, { "start": 609, "stop": 615, "text": "'s" }, { "start": 623, "stop": 649, "text": " right" }, { "start": 649, "stop": 658, "text": "." }, { "start": 667, "stop": 706, "text": " Michael" }, { "start": 707, "stop": 718, "text": " is" }, { "start": 718, "stop": 741, "text": " right" }, { "start": 752, "stop": 765, "text": "." }, { "start": 765, "stop": 777, "text": " It" }, { "start": 777, "stop": 788, "text": "'s" }, { "start": 788, "stop": 794, "text": " a" }, { "start": 794, "stop": 818, "text": " made" }, { "start": 818, "stop": 819, "text": "-" }, { "start": 831, "stop": 834, "text": "up" }, { "start": 834, "stop": 855, "text": " word" }, { "start": 858, "stop": 879, "text": " used" }, { "start": 886, "stop": 894, "text": " to" }, { "start": 894, "stop": 931, "text": " trick" }, { "start": 936, "stop": 990, "text": " students" }, { "start": 990, "stop": 1008, "text": "." }, { "start": 1010, "stop": 1012, "text": " No" }, { "start": 1037, "stop": 1079, "text": " actually" }, { "start": 1095, "stop": 1095, "text": " wh" }, { "start": 1095, "stop": 1116, "text": "ome" }, { "start": 1132, "stop": 1137, "text": "ver" }, { "start": 1137, "stop": 1151, "text": " is" }, { "start": 1151, "stop": 1172, "text": " the" }, { "start": 1172, "stop": 1214, "text": " formal" }, { "start": 1214, "stop": 1263, "text": " version" }, { "start": 1263, "stop": 1277, "text": " of" }, { "start": 1277, "stop": 1298, "text": " the" }, { "start": 1298, "stop": 1326, "text": " word" }, { "start": 1326, "stop": 1347, "text": "." }, { "start": 1347, "stop": 1417, "text": " Obviously" }, { "start": 1418, "stop": 1428, "text": " it" }, { "start": 1428, "stop": 1435, "text": "'s" }, { "start": 1440, "stop": 1443, "text": " a" }, { "start": 1443, "stop": 1464, "text": " real" }, { "start": 1464, "stop": 1485, "text": " word" }, { "start": 1485, "stop": 1494, "text": "," }, { "start": 1494, "stop": 1505, "text": " but" }, { "start": 1509, "stop": 1512, "text": " I" }, { "start": 1522, "stop": 1530, "text": " don" }, { "start": 1530, "stop": 1538, "text": "'t" }, { "start": 1547, "stop": 1561, "text": " know" }, { "start": 1561, "stop": 1582, "text": " when" }, { "start": 1582, "stop": 1592, "text": " to" }, { "start": 1592, "stop": 1607, "text": " use" }, { "start": 1607, "stop": 1617, "text": " it" }, { "start": 1617, "stop": 1664, "text": " correctly" }, { "start": 1664, "stop": 1678, "text": "." }, { "start": 1678, "stop": 1694, "text": " Not" }, { "start": 1694, "stop": 1698, "text": " a" }, { "start": 1699, "stop": 1730, "text": " native" }, { "start": 1730, "stop": 1761, "text": " speaker" }, { "start": 1767, "stop": 1792, "text": "." }, { "start": 1792, "stop": 1798, "text": " I" }, { "start": 1800, "stop": 1823, "text": " know" }, { "start": 1823, "stop": 1848, "text": " what" }, { "start": 1848, "stop": 1860, "text": "'s" }, { "start": 1860, "stop": 1881, "text": " right" }, { "start": 1889, "stop": 1903, "text": "," }, { "start": 1904, "stop": 1910, "text": " but" }, { "start": 1923, "stop": 1927, "text": " I" }, { "start": 1927, "stop": 1939, "text": "'m" }, { "start": 1939, "stop": 1957, "text": " not" }, { "start": 1957, "stop": 1988, "text": " gonna" }, { "start": 1988, "stop": 2005, "text": " say" }, { "start": 2005, "stop": 2023, "text": " because" }, { "start": 2050, "stop": 2067, "text": " you" }, { "start": 2067, "stop": 2085, "text": "'re" }, { "start": 2085, "stop": 2103, "text": " all" }, { "start": 2103, "stop": 2120, "text": " jer" }, { "start": 2125, "stop": 2133, "text": "ks" }, { "start": 2133, "stop": 2148, "text": " who" }, { "start": 2157, "stop": 2175, "text": " didn" }, { "start": 2177, "stop": 2199, "text": "'t" }, { "start": 2206, "stop": 2218, "text": " come" }, { "start": 2218, "stop": 2231, "text": " see" }, { "start": 2231, "stop": 2240, "text": " my" }, { "start": 2240, "stop": 2258, "text": " band" }, { "start": 2258, "stop": 2276, "text": " last" }, { "start": 2276, "stop": 2293, "text": " night" }, { "start": 2301, "stop": 2312, "text": "." }, { "start": 2312, "stop": 2321, "text": " Do" }, { "start": 2321, "stop": 2334, "text": " you" }, { "start": 2334, "stop": 2361, "text": " really" }, { "start": 2361, "stop": 2379, "text": " know" }, { "start": 2379, "stop": 2402, "text": " which" }, { "start": 2402, "stop": 2411, "text": " one" }, { "start": 2417, "stop": 2424, "text": " is" }, { "start": 2424, "stop": 2456, "text": " correct" }, { "start": 2456, "stop": 2457, "text": "?" }, { "start": 2471, "stop": 2473, "text": " I" }, { "start": 2473, "stop": 2486, "text": " don" }, { "start": 2486, "stop": 2504, "text": "'t" }, { "start": 2504, "stop": 2507, "text": " know" }, { "start": 2524, "stop": 2540, "text": "." }, { "start": 2540, "stop": 2551, "text": " It" }, { "start": 2551, "stop": 2561, "text": "'s" }, { "start": 2574, "stop": 2584, "text": " whom" }, { "start": 2591, "stop": 2608, "text": " when" }, { "start": 2608, "stop": 2619, "text": " it" }, { "start": 2619, "stop": 2630, "text": "'s" }, { "start": 2630, "stop": 2647, "text": " the" }, { "start": 2647, "stop": 2682, "text": " object" }, { "start": 2682, "stop": 2693, "text": " of" }, { "start": 2693, "stop": 2710, "text": " the" }, { "start": 2710, "stop": 2756, "text": " sentence" }, { "start": 2756, "stop": 2773, "text": " and" }, { "start": 2773, "stop": 2790, "text": " who" }, { "start": 2790, "stop": 2813, "text": " when" }, { "start": 2813, "stop": 2824, "text": " is" }, { "start": 2824, "stop": 2841, "text": " the" }, { "start": 2841, "stop": 2879, "text": " subject" }, { "start": 2881, "stop": 2905, "text": "." }, { "start": 2917, "stop": 2942, "text": " That" }, { "start": 2942, "stop": 2964, "text": " That" }, { "start": 2964, "stop": 2993, "text": " sounds" }, { "start": 2997, "stop": 3023, "text": " right" }, { "start": 3026, "stop": 3042, "text": "." }, { "start": 3042, "stop": 3047, "text": " Well" }, { "start": 3052, "stop": 3057, "text": "," }, { "start": 3057, "stop": 3062, "text": " it" }, { "start": 3062, "stop": 3076, "text": " sounds" }, { "start": 3077, "stop": 3089, "text": " right" }, { "start": 3089, "stop": 3094, "text": "," }, { "start": 3094, "stop": 3101, "text": " but" }, { "start": 3101, "stop": 3106, "text": " is" }, { "start": 3106, "stop": 3111, "text": " it" }, { "start": 3111, "stop": 3121, "text": "?" }, { "start": 3122, "stop": 3137, "text": " How" }, { "start": 3137, "stop": 3152, "text": " did" }, { "start": 3152, "stop": 3171, "text": " Ryan" }, { "start": 3171, "stop": 3186, "text": " use" }, { "start": 3186, "stop": 3196, "text": " it" }, { "start": 3196, "stop": 3205, "text": "," }, { "start": 3205, "stop": 3215, "text": " as" }, { "start": 3215, "stop": 3223, "text": " an" }, { "start": 3227, "stop": 3254, "text": " object" }, { "start": 3254, "stop": 3272, "text": "?" }, { "start": 3272, "stop": 3280, "text": " As" }, { "start": 3280, "stop": 3288, "text": " an" }, { "start": 3288, "stop": 3309, "text": " object" }, { "start": 3309, "stop": 3324, "text": "." }, { "start": 3324, "stop": 3353, "text": " Ryan" }, { "start": 3353, "stop": 3382, "text": " used" }, { "start": 3382, "stop": 3396, "text": " me" }, { "start": 3396, "stop": 3410, "text": " as" }, { "start": 3410, "stop": 3424, "text": " an" }, { "start": 3424, "stop": 3466, "text": " object" }, { "start": 3494, "stop": 3494, "text": "." }, { "start": 3502, "stop": 3506, "text": " Is" }, { "start": 3506, "stop": 3516, "text": " he" }, { "start": 3520, "stop": 3549, "text": " right" }, { "start": 3549, "stop": 3580, "text": " about" }, { "start": 3580, "stop": 3605, "text": " that" }, { "start": 3605, "stop": 3609, "text": "?" }, { "start": 3627, "stop": 3640, "text": " How" }, { "start": 3640, "stop": 3654, "text": " did" }, { "start": 3654, "stop": 3663, "text": " he" }, { "start": 3663, "stop": 3677, "text": " use" }, { "start": 3677, "stop": 3686, "text": " it" }, { "start": 3686, "stop": 3709, "text": " again" }, { "start": 3709, "stop": 3726, "text": "?" }, { "start": 3726, "stop": 3735, "text": " It" }, { "start": 3735, "stop": 3749, "text": " was" }, { "start": 3749, "stop": 3775, "text": "..." }, { "start": 3794, "stop": 3814, "text": " Ryan" }, { "start": 3814, "stop": 3847, "text": " wanted" }, { "start": 3847, "stop": 3885, "text": " Michael" }, { "start": 3885, "stop": 3897, "text": "," }, { "start": 3897, "stop": 3914, "text": " the" }, { "start": 3914, "stop": 3952, "text": " subject" }, { "start": 3952, "stop": 3960, "text": "," }, { "start": 3964, "stop": 3975, "text": " to" }, { "start": 3975, "stop": 4014, "text": " explain" }, { "start": 4014, "stop": 4031, "text": " the" }, { "start": 4031, "stop": 4076, "text": " computer" }, { "start": 4076, "stop": 4105, "text": " system" }, { "start": 4109, "stop": 4120, "text": "," }, { "start": 4120, "stop": 4137, "text": " the" }, { "start": 4137, "stop": 4170, "text": " object" }, { "start": 4170, "stop": 4194, "text": "." }, { "start": 4214, "stop": 4227, "text": " Thank" }, { "start": 4227, "stop": 4242, "text": " you" }, { "start": 4247, "stop": 4265, "text": "." }, { "start": 4265, "stop": 4278, "text": " To" }, { "start": 4278, "stop": 4291, "text": " wh" }, { "start": 4291, "stop": 4310, "text": "ome" }, { "start": 4310, "stop": 4329, "text": "ver" }, { "start": 4329, "stop": 4340, "text": "," }, { "start": 4358, "stop": 4388, "text": " meaning" }, { "start": 4388, "stop": 4401, "text": " us" }, { "start": 4401, "stop": 4411, "text": "," }, { "start": 4418, "stop": 4429, "text": " the" }, { "start": 4433, "stop": 4486, "text": " indirect" }, { "start": 4486, "stop": 4524, "text": " object" }, { "start": 4525, "stop": 4546, "text": "," }, { "start": 4549, "stop": 4573, "text": " which" }, { "start": 4573, "stop": 4584, "text": " is" }, { "start": 4584, "stop": 4600, "text": " the" }, { "start": 4600, "stop": 4636, "text": " correct" }, { "start": 4641, "stop": 4661, "text": " usage" }, { "start": 4668, "stop": 4677, "text": " of" }, { "start": 4677, "stop": 4693, "text": " the" }, { "start": 4693, "stop": 4715, "text": " word" }, { "start": 4715, "stop": 4736, "text": "." } ] ```

Maybe I have mistake in how I consume the segments That's how I create the word segments:

core/src/model.rs#L134

arizhih commented 1 day ago

Maybe I have mistake in how I consume the segments That's how I create the word segments:

core/src/model.rs#L134

Yes, you get tokens, but you need to get segment text. Try to use this let text = state.full_get_segment_text_lossy(s).context("failed to get segment")?;

thewh1teagle commented 1 day ago

Yes, you get tokens, but you need to get segment text. Try to use this let text = state.full_get_segment_text_lossy(s).context("failed to get segment")?;

Notice that I said word segments, in general I already use there get_segment_text in the else statement. Do I need to use get_segment_text even in the loop of the num_tokens?

arizhih commented 1 day ago

My proposal was to use max_len 1 and split_on_word and I think that with this options each segment will be a single word.

So you don’t need to use tokens at all, only segments.

for s in 0..num_segments {
        let text = state.full_get_segment_text_lossy(s).context("failed to get segment")?;
        let start = state.full_get_segment_t0(s).context("failed to get start timestamp")?;
        let stop = state.full_get_segment_t1(s).context("failed to get end timestamp")?;
            segments.push(Segment { text, start, stop });
}

If this doesn’t help tomorrow I’ll give you example how to create words from tokens.

thewh1teagle commented 1 day ago

@arizhih

It worked! I tried so many options there but didn't thought about this one

Thank you so much :)

tazz4843 / whisper-rs

max tokens and split on word params doesn't work #156