Timestamps - Githubissues

I'm doing some experimenting with this crate. If I use the following params:


params.set_print_special(false);
params.set_print_progress(false);
params.set_print_realtime(false);
params.set_print_timestamps(false);
params.set_token_timestamps(true);
params.set_max_len(1);

And attempt to get the individual token timestamps


let num_segments = state.full_n_segments().expect("Failed to get segments.");

for i in 0..num_segments {
    /* extract words */
    let token_count = state
        .full_n_tokens(i)
        .expect("Unable to get get token count from segments.");
    for z in 0..token_count {
        let item = state
            .full_get_token_data(i, z)
            .expect("Unable to get full token data");
        let token_id = item.id;
        let start_time = item.t0 * 1_000_000;
        let end_time = item.t1 * 1_000_000;
        let probability = item.p;

        let word = state
            .full_get_token_text(i, z)
            .expect("Unable to get token text");

        if RE.is_match(word.trim()) {
            continue;
        }
        let pts_start_time = vad_chunk.start_pts + start_time as u64;
        let pts_end_time = pts_start_time + (end_time as u64 - start_time as u64);

        recognized_words.push(RecognizedWord {
            media_id: media_id.clone(),
            start_time: pts_start_time,
            end_time: pts_end_time,
            word,
            confidence: probability,
        });
    }

Here are some issues I've found:

1) It's generally required to use a VAD algo to determine(approximate) the start position, as whisper.cpp returns 0(typically) for the start of the first word, for t0 regardless of the actual position detected in the audio segement(especially after periods of no audio).

2) Looking at the whisper.cpp source quickly it appears the t0 and t1 are supposed to be in milliseconds(my code uses nanoseconds), but it routinely report distances from start_time and end_time that are nonsensical, like 72 for the word "Paul", and it's not practical for a human to speak that word in that timeframe. I thought maybe this required some conversion step, but I can't make sense of what that might be. So adding the voice_activity_start position to the t0 and t1 values does not seem to work consistently to align things properly.

Is there a special combination of params that allow for pulling(semi) accurate timestamps? Is there another way to approach this? This also happens when I try to get the segment timestamp(s). I'm trying to come up with a solution here that allows for at the very least accurate detection of the start position of the word(if nothing else), as I could always attempt to detect the end position and start of the next word segment if necessary to force alignment using the original audio slice.

I dug through the source code and found this function for converting the t0 and t1 to a human-readable timestamp:

//  500 -> 00:05.000
// 6000 -> 01:00.000
static std::string to_timestamp(int64_t t, bool comma = false) {
    int64_t msec = t * 10;
    int64_t hr = msec / (1000 * 60 * 60);
    msec = msec - hr * (1000 * 60 * 60);
    int64_t min = msec / (1000 * 60);
    msec = msec - min * (1000 * 60);
    int64_t sec = msec / 1000;
    msec = msec - sec * 1000;

    char buf[32];
    snprintf(buf, sizeof(buf), "%02d:%02d:%02d%s%03d", (int) hr, (int) min, (int) sec, comma ? "," : ".", (int) msec);

    return std::string(buf);
}

Not sure if you also found the above, but either way some of your code would be nice to verify.

It's generally required to use a VAD algo to determine(approximate) the start position, as whisper.cpp returns 0(typically) for the start of the first word, for t0 regardless of the actual position detected in the audio segement(especially after periods of no audio).

I experience this with the ./main binary from the upstream whisper.cpp repo, so I don't think this is a whisper-rs issue anyways.

This:

int64_t msec = t * 10;

That was part of my problem with the nonsensical values. I did actually find that block in the source, but I apparently didn't pay attention to the fact that it's in microseconds and not milliseconds to start when I looked at it.

Combined with something like this:

        let chunk_duration_ms = 10;
        let chunk_size = (chunk_duration_ms as f32 * 16.0) as usize;

            while slice_start + chunk_size < slice_end
                && !vad
                    .is_voice_segment(&vad_chunk.slice[slice_start..(slice_start + chunk_size)])
                    .unwrap_or(false)
            {
                slice_start += chunk_size;
            }

            let slice_start_ms = ((slice_start as f32 / chunk_size as f32)
                * chunk_duration_ms as f32)
                .round() as u64;
            let first_voice_activity = slice_start_ms * 1_000_000;

To fish out the offset of the buffer being processed, it seems to be laying in fairly correctly now(with a few somewhat minor alignment issues with the presentation timestamp).

I'll go ahead and close this issue given your comments up the upstream issues with the main prog. Thank you for pasting that block of code and for the assistance with this crate. I appreciate it!

tazz4843 / whisper-rs

Timestamps #71