jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.6k stars 177 forks source link

Very unstable results. Everytime I run model.transcribe, I get different results. #199

Closed furqan4545 closed 1 year ago

furqan4545 commented 1 year ago

result2 = model.transcribe('tate_pier.mp3', mel_first=True,demucs=True) result2 = model.transcribe('tate_pier.mp3', mel_first=True)

I used different settings as shown above. Also with VAD and without VAD. the accuracy is not as great as original whisper. It is missing alot of words sometime and I am using large-v2 model but still... Could you please tell me if there is any specific parameter which I can use so that it doesn't miss the words. And also everytime I run it gives me different results. Not very consistent. So, I don't know why how can I make them consistent. Sometime they are very good and sometime really bad. Your help will be highly appreciated.

jianfch commented 1 year ago

Try to this test to see if you have the required version of Whisper:

import importlib.metadata
import whisper
import warnings

_required_whisper_ver = list(
    filter(lambda x: x.startswith('openai-whisper'), importlib.metadata.distribution('stable-ts').requires)
)[0].split('==')[-1]

if (
        whisper.__version__ != _required_whisper_ver or  # check version
        importlib.metadata.distribution('openai-whisper').read_text('direct_url.json')  # check if installed from repo
):
    warnings.warn('The installed version of Whisper might be incompatible.\n'
                  'To prevent errors and performance issues, reinstall correct version with: '
                  f'"pip install --upgrade --no-deps --force-reinstall openai-whisper=={_required_whisper_ver}".')

It would help if you could share an audio file you're seeing this issue with and its results as JSON files (preferrable more than one JSON with different results).

furqan4545 commented 1 year ago

my whisper version is 20230314.

this is the video I'm transcribing. https://www.youtube.com/watch?v=dFxsi5GUQ5c&t=8s

here is link to my colab notebook.

https://colab.research.google.com/drive/1AW9oS1NxPe_wpMaLskIOCkL6nSfitjqj?usp=sharing

jianfch commented 1 year ago

You can use transcribe_minimal() to transcribe. It should give results similar to Whisper because it uses Whisper's default transcribe(). Note that if demucs=True, it will increase the chances of getting different results each time because Demucs generates a slight different audio output each time, unless you specify the same seed each time you run it.

furqan4545 commented 1 year ago

Hi, I tried transcribe_minimal() as well with different parameters, but i don't think so it is original whisper, because it works exactly same as transcribe from stable whisper. I think you need to double check when you get time brother. Secondly, how can I pass fix seed or same seed into demucs as you mentioned above? I explored the code but there is no such default parameter.. Can you tell me place where I can pass seed to keep the output same.

jianfch commented 1 year ago

The differences you see are likely due to the options you used. If you disable the pre and post processing for transcribe_minimal(), the results should match original Whisper's results because it calls whisper.transcribe() directly.

import whisper, stable_whisper
model = whisper.load_model('base')
res_original = model.transcribe('audio.mp3', word_timestamps=True, verbose=False)
res_original = stable_whisper.WhisperResult(res_original , force_order=True)

stable_whisper.modify_model(model)
# [demucs] and [only_voice_freq] are False by default, so preprocessing is disabled by default
# [regroup=False] and [suppress_silence=False] disable postprocessing
res_stable = model.transcribe_minimal('audio.mp3', regroup=False, suppress_silence=False)

assert res_original.to_srt_vtt() == res_stable.to_srt_vtt()

Specify before each run with:

import random
random.seed(0)

# test if seed works
import torch
from stable_whisper.audio import demucs_audio
random.seed(0)
vocal0 = demucs_audio('audio.mp3')
random.seed(0)
vocal1 = demucs_audio('audio.mp3')
assert torch.isclose(vocal0, vocal1).all()
furqan4545 commented 1 year ago

Bro, this work like a G... Thanks for your amazing work and help. Really means alot. I am building my SaaS around whisper and some TTS services and I assure you, once we grow and get enough customers to sustain ourself, I will surely come back and pay back some really nice reward for your amazing work. You are G.