jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.61k stars 178 forks source link

segments is empty when using word_timestamps=False #100

Closed flesnuk closed 1 year ago

flesnuk commented 1 year ago

I don't want the word timing functionality. So I use word_timestamps=False parameter in transcribe function.

When trying to save the results using result.to_srt_vtt I get an empty file. So I have to resort using the Whisper writer like so:

writer = get_writer("srt", args.save_dir)
result = model.transcribe ...
writer(result.to_dict().get('ori_dict'), "file.srt")
moriseika commented 1 year ago

Setting segment_level=True and word_level=False instead of word_timestamps=False in the to_srt_vtt arguments allowed SRT generation at the segment level only

result.to_srt_vtt(output_path, segment_level=True, word_level=False)
flesnuk commented 1 year ago

yes, but that makes the script use the timing.py from whisper for word timing processing, which weirdly makes a increase in VRAM usage when using a finetuned model (exceeding 8gb vram for medium finetuned model, but works with standard medium, it's weird). The OOM error is caused here: https://github.com/openai/whisper/blob/main/whisper/timing.py#L49

That's why I use the word_timestamps=False option for now.

jianfch commented 1 year ago

A temporary quick fix is use regroup=False because there appears to be bug in the regrouping logic.

result = model.transcribe('audio.mp3', regroup=False)
jianfch commented 1 year ago

Should be fixed in the latest commit.

flesnuk commented 1 year ago

Thanks, with the last commit it works