jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.6k stars 177 forks source link

Problems with the latest version #173

Closed fablau closed 1 year ago

fablau commented 1 year ago

Hello. I just installed the latest version in the repo, and now I get incorrect transcriptions.

Here is the video I have tried to transcribe:

https://youtu.be/aFJCahnRF9s

It gets perfectly transcribed with the regular Whisper, but with stable-ts I skips some words.

Here is the code I have been using so far, and it always worked before:

model = stable_whisper.load_model('medium', download_root=rootDir)

results = model.transcribe([audioFile], language = "English")

results.to_srt_vtt("subs.srt", word_level=False, strip=True)

I am pasting below the first 13 subs, which are clearly incorrect (many words are missing):

0
00:00:00,060 --> 00:00:04,860
Piano is more a quality of tone than an absolute volume.

1
00:00:05,580 --> 00:00:09,400
Not only that, but you must project out into the room.

2
00:00:09,720 --> 00:00:12,140
Remember you're not just playing the piano,

3
00:00:12,400 --> 00:00:14,720
you're playing the room you're in.

4
00:00:25,020 --> 00:00:28,140
Hi, I'm Robert Estrin, this is LivingPianos.com.

5
00:00:28,640 --> 00:00:31,060
that piano?

6
00:00:38,240 --> 00:00:48,220
How soft is soft and how do you even achieve it on the piano?

7
00:00:48,580 --> 00:00:53,620
We're going to dive right into this today and cover this in a way that may make sense to you.

8
00:00:54,460 --> 00:00:57,760
You know that there are something called a decibel meter.

9
00:00:58,460 --> 00:01:00,140
It measures the unit of volume

10
00:01:00,800 --> 00:01:02,940
and you might like to have an answer.

11
00:01:03,300 --> 00:01:05,060
Sometimes people see Allegro in

12
00:01:05,060 --> 00:01:05,980
their score and they go,

13
00:01:06,360 --> 00:01:07,680
how fast is Allegro?

Any ideas?

jianfch commented 1 year ago

It seems like this is one of those cases where suppressing the timestamp tokens throws the model off. Use suppress_ts_tokens=False to disable it.

results = model.transcribe(..., suppress_ts_tokens=False)
fablau commented 1 year ago

Perfect, problem solved! Thank you!

What are the use cases where you suggest setting that to "true"?

jianfch commented 1 year ago

Empirically found it tends to do better with vad=True. This might be due to the VAD being more accurate, in terms detecting speech, than the default non-VAD method which might be suppressing too many of the "good" timestamps that it ends up picking ones that make the model skip words and hallucinate. suppress_ts_tokens=True made more sense as the default in version 1.x but probably shouldn't have carried over to 2.x because it offers little benefit. The timestamps produced by the token suppression are discarded anyway when word_timestamps=True.

fablau commented 1 year ago

Got it. Thanks!