Closed furqan4545 closed 1 year ago
Try to this test to see if you have the required version of Whisper:
import importlib.metadata
import whisper
import warnings
_required_whisper_ver = list(
filter(lambda x: x.startswith('openai-whisper'), importlib.metadata.distribution('stable-ts').requires)
)[0].split('==')[-1]
if (
whisper.__version__ != _required_whisper_ver or # check version
importlib.metadata.distribution('openai-whisper').read_text('direct_url.json') # check if installed from repo
):
warnings.warn('The installed version of Whisper might be incompatible.\n'
'To prevent errors and performance issues, reinstall correct version with: '
f'"pip install --upgrade --no-deps --force-reinstall openai-whisper=={_required_whisper_ver}".')
It would help if you could share an audio file you're seeing this issue with and its results as JSON files (preferrable more than one JSON with different results).
my whisper version is 20230314.
this is the video I'm transcribing. https://www.youtube.com/watch?v=dFxsi5GUQ5c&t=8s
here is link to my colab notebook.
https://colab.research.google.com/drive/1AW9oS1NxPe_wpMaLskIOCkL6nSfitjqj?usp=sharing
You can use transcribe_minimal()
to transcribe. It should give results similar to Whisper because it uses Whisper's default transcribe()
.
Note that if demucs=True
, it will increase the chances of getting different results each time because Demucs generates a slight different audio output each time, unless you specify the same seed each time you run it.
Hi, I tried transcribe_minimal() as well with different parameters, but i don't think so it is original whisper, because it works exactly same as transcribe from stable whisper. I think you need to double check when you get time brother. Secondly, how can I pass fix seed or same seed into demucs as you mentioned above? I explored the code but there is no such default parameter.. Can you tell me place where I can pass seed to keep the output same.
The differences you see are likely due to the options you used. If you disable the pre and post processing for transcribe_minimal()
, the results should match original Whisper's results because it calls whisper.transcribe()
directly.
import whisper, stable_whisper
model = whisper.load_model('base')
res_original = model.transcribe('audio.mp3', word_timestamps=True, verbose=False)
res_original = stable_whisper.WhisperResult(res_original , force_order=True)
stable_whisper.modify_model(model)
# [demucs] and [only_voice_freq] are False by default, so preprocessing is disabled by default
# [regroup=False] and [suppress_silence=False] disable postprocessing
res_stable = model.transcribe_minimal('audio.mp3', regroup=False, suppress_silence=False)
assert res_original.to_srt_vtt() == res_stable.to_srt_vtt()
Specify before each run with:
import random
random.seed(0)
# test if seed works
import torch
from stable_whisper.audio import demucs_audio
random.seed(0)
vocal0 = demucs_audio('audio.mp3')
random.seed(0)
vocal1 = demucs_audio('audio.mp3')
assert torch.isclose(vocal0, vocal1).all()
Bro, this work like a G... Thanks for your amazing work and help. Really means alot. I am building my SaaS around whisper and some TTS services and I assure you, once we grow and get enough customers to sustain ourself, I will surely come back and pay back some really nice reward for your amazing work. You are G.
result2 = model.transcribe('tate_pier.mp3', mel_first=True,demucs=True) result2 = model.transcribe('tate_pier.mp3', mel_first=True)
I used different settings as shown above. Also with VAD and without VAD. the accuracy is not as great as original whisper. It is missing alot of words sometime and I am using large-v2 model but still... Could you please tell me if there is any specific parameter which I can use so that it doesn't miss the words. And also everytime I run it gives me different results. Not very consistent. So, I don't know why how can I make them consistent. Sometime they are very good and sometime really bad. Your help will be highly appreciated.