m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
12k stars 1.26k forks source link

OG whisper word level timestamps support #286

Open Saccarab opened 1 year ago

Saccarab commented 1 year ago

is it possible to use OG whisper word-level timestamps and skip forced alignment?

m-bain commented 1 year ago

You will likely need v2 for that https://github.com/m-bain/whisperX/issues/232#issuecomment-1546460436

The batched inference does not currently support word level timestamps as of yet

Saccarab commented 1 year ago

would it be possible in theory to enable word_level timestamps through faster_whisper and patch them into the segments?

stri8ed commented 7 months ago

would it be possible in theory to enable word_level timestamps through faster_whisper and patch them into the segments?

Yes. The timestamp tokens are being filtered out during decoding. You can remove the filtering, and then process them as needed.

E.g.

for j, token in enumerate(tokens):
    if token >= self.tokenizer.timestamp_begin:
        timestamp_position = (
                token - self.tokenizer.timestamp_begin
        )
        ts_time = (
                round(vad_segments[idx]['start'], 3) + timestamp_position * 0.02
        )
        if start_time is None:
            start_time = ts_time
        else:
            end_time = ts_time
            text = self.tokenizer.decode(token_buffer)
            segments.append(
                {
                    "text": text,
                    "start": start_time,
                    "end": end_time
                }
            )
            token_buffer = []
            start_time = None
    else:
        token_buffer.append(token)