jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.56k stars 173 forks source link

Transcription quality #404

Open qo4on opened 1 month ago

qo4on commented 1 month ago

Why is the quality of stable-ts transcription much worse than that of openai/whisper? New lines of text are added where they should not be, numbers like 0.003 and 0.05 are defined as 0 0 3 and 0 0 5...

whisper audio.mp3 --model turbo

import stable_whisper
model = stable_whisper.load_model('turbo')
result = model.transcribe('audio.mp3')
result.to_txt('audio.txt')
jianfch commented 1 month ago

It seems like the doing of the default regrouping. Try disableling it with model.transcribe(..., regroup=False) or use a custom one that handles numbers better. If this does not resolve the issue, can you share an audio clip that can replicate this?

qo4on commented 1 month ago

regroup=False

This makes the subtitles huge, about 500 characters long each. In text files, it did help get rid of a lot of incorrectly added newlines, but not all of them.

Unfortunately, I can't publish this particular audio file. But as far as I noticed, the quality of transcribing stable-ts is much worse than openai just for audio in Russian, which is noticeable on almost any Russian audio. You can easily make sure of this even without knowing Russian, you can just ask to compare the transcription quality of the results obtained by stable-ts and official openai with some LLM, for example ChatGPT or Gemini. They provide a very valid comparison.

jianfch commented 1 month ago

Try model.transcribe_minimal(..., regroup=False). This will run the original transcription function of the official Whisper and keep the output text the same but make minor adjustments to the timestamps.

qo4on commented 1 month ago

Thank you.