jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.59k stars 177 forks source link

Stable auto-detecting English after 30+ seconds of silence even if language is specified in the prompt. #220

Closed Adriann25 closed 1 year ago

Adriann25 commented 1 year ago

Hello, everyone. Recently, I got stumped by this seemingly small thing.

I wanted to have a video in German transcribed in its original language but noticed that stable-ts was auto-detecting the language as being English. More specifically it translates rather than transcribes everything. Whisper seems to detect it accurately somehow.

I played around with it a bit, specifying the task, switching between "--language de" and "language German", changing the model from v2 to v1 with no success until I noticed it's because of the silent audio at the start that goes more than 30 seconds. So I did a test, ripped the first few lines of dialogue from the video, and sure enough it was detected as German. My prompt was simply: !stable-ts "Test.mp4" -o "Test.srt" --task transcribe --language German --model large-v2 --word_level=False

Is there something that makes stable force auto-detect the language even if you've specified it if there is no audio in the first 30 seconds? Or am I missing something here.

jianfch commented 1 year ago

It should be fixed in 9aba3dc6655743e051479c479809020fdadf95fe. To get it work before the fix, use --transcribe_option "language=German".

Adriann25 commented 1 year ago

Thank you for the prompt response. Works perfectly again.