Open Patrick10731 opened 4 months ago
The models with preconfigured alignment heads or ones compatible with original heads will work.
For the ones compatible with the original heads, you can manually config it by assigning the head indices to model._pipe.model.generation_config.alignment_heads
.
Technically even models without alignment heads, such as distil-large-v2
, will work as well by disabling word timestamps with model.transcribe('audio.mp3', word_timestamps=False)
. However, many features, such as regrouping and word-level timestamp adjustment, will be unavailable.
Hi @Patrick10731 , did you get any of kotoba-whisper models to work with Stable_ts? I am trying their kotoba-tech/kotoba-whisper-v2.1 model, but I keep getting out of memory error.
@jianfch , I'm not sure if you have already come across kotobal-tech models in Huggingface. Their latest model is using Stable-ts for accurate timestamp and regroup. I thought you might be interested.
@jianfch Thanks, it worked
@dgoryeo I confirmed that this code will work, try it.
import stable_whisper
model = stable_whisper.load_hf_whisper('kotoba-tech/kotoba-whisper-v1.1', device='cpu')
result = model.transcribe('audio.mp3', word_timestamps=False)
result.to_srt_vtt('audio.srt', word_level=False)
I also found that many models still won't work but will work if you convert the model into faster-whisper's model.
For example, this model won't work
import stable_whisper
model = stable_whisper.load_hf_whisper('Scrya/whisper-large-v2-cantonese', device='cpu')
result = model.transcribe('audio.mp3', word_timestamps=False)
result.to_srt_vtt('audio.srt', word_level=False)
But following code will work.
import stable_whisper
model = stable_whisper.load_faster_whisper('XA9/faster-whisper-large-v2-cantonese-2', device='cpu', compute_type='default')
result = model.transcribe_stable('audio.mp3')
result.to_srt_vtt('audio.srt', word_level=False)
This converted model is from here (https://huggingface.co/XA9/faster-whisper-large-v2-cantonese-2), and this model is converted by using following command.
ct2-transformers-converter --model Scrya/whisper-large-v2-cantonese --output_dir faster-whisper-large-v2-cantonese-2 --copy_files preprocessor_config.json --quantization float16
So I recommend to try converting model if a model won't work.
Thank you @Patrick10731 , by any chance have you tried Kotoba's v2.1 (which is a distilled) Whisper?
I will try to follow your recommendation. At the moment I am running out of memory with 2.1 but I haven't tried on CPU only --I've tried device=cuda so far.
@dgoryeo I tried with this code and it worked. How about you to try setting device='cpu'? The reason of running out of memory must be lacking of performance of your video card.
import stable_whisper
model = stable_whisper.load_hf_whisper('kotoba-tech/kotoba-whisper-v2.1', device='cpu')
result = model.transcribe('audio.mp3', word_timestamps=False)
result.to_srt_vtt('audio.srt', word_level=False)
Thanks @Patrick10731 , I will test it on cpu. I have 12GB gpu vram, so didn't expect to run out of memory.. I'll test and will report back.
@dgoryeo 12GB might be too low for the default batch_size=24
. Try smaller batch_size
.
@jianfch , that must be it. I'll change the batch_size accordingly.
When I use the model directly with transformers, I use batch_size 16 with no problem:
pipe = pipeline(
model=model_id,
torch_dtype=torch_dtype,
device=device,
model_kwargs=model_kwargs,
chunk_length_s=15,
batch_size=16,
trust_remote_code=True,
stable_ts=True,
punctuator=True
)
Thanks
@dgoryeo You can pass this pipe
directly to the pipeline
parameter of stable_whisper.load_hf_whisper()
.
Here to reporting back that it worked.
I tested the both options:
(a) direct calling model = stable_whisper.load_hf_whisper('kotoba-tech/kotoba-whisper-v2.1', device='cuda')
, and
(b) passing the pipe
parameter to stable_whisper.load_hf_whisper()
, device cuda.
Both worked. Though I was happier with the results of (a).
I tryed to use distil-whisper-v3 in stable-ts and it can be used. However, it's unable to be used when I try to use "distil-large-v2". Other model can't be used too.(ex:kotoba-whisper,"kotoba-tech/kotoba-whisper-v1.0") What kind of model can be used in stable-ts except for OpenAI's model?
import stable_whisper
model = stable_whisper.load_hf_whisper('distil-whisper/distil-large-v3', device='cpu') result = model.transcribe('audio.mp3')
result.to_srt_vtt('audio.srt', word_level=False)