m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
12.66k stars 1.34k forks source link

Word-level timestamps not working with python implementation #910

Open rkulyassa opened 1 month ago

rkulyassa commented 1 month ago

I am attempting to run whisperx with word-level timestamps, but despite passing the relevant option, the output is of the form {'segments': [ ... ], 'language': 'en'} with no word_segments.

I dug around a bit but could not find out why this is happening. I have confirmed that model.options.word_timestampts is True, so I believe it is an internal issue with model.transcribe, and perhaps the options are not properly being wrapped to faster-whisper.

My code:

        model = whisperx.load_model(
            model_name,
            device=device,
            compute_type=compute_type,
            language="en",
            task="transcribe",
            asr_options={"word_timestamps": True},
        )
        print(model.options.word_timestamps)  # True
        transcript = model.transcribe(video_path, language="en")  # doesn't include word-level timestamps

It should be noted that running via command line works fine:

whisperx \
    --model large-v2 \
    --compute_type int8 \
    --output_format json \
    --suppress_numerals \
    --task transcribe \
    --language en \
    $input_file

This properly includes word_segments in the json output.