jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.41k stars 164 forks source link

Huggingface's Fine Tuned model that can be used? #378

Open Patrick10731 opened 6 days ago

Patrick10731 commented 6 days ago

I tryed to use distil-whisper-v3 in stable-ts and it can be used. However, it's unable to be used when I try to use "distil-large-v2". Other model can't be used too.(ex:kotoba-whisper,"kotoba-tech/kotoba-whisper-v1.0") What kind of model can be used in stable-ts except for OpenAI's model?

import stable_whisper

model = stable_whisper.load_hf_whisper('distil-whisper/distil-large-v3', device='cpu') result = model.transcribe('audio.mp3')

result.to_srt_vtt('audio.srt', word_level=False)

jianfch commented 6 days ago

The models with preconfigured alignment heads or ones compatible with original heads will work. For the ones compatible with the original heads, you can manually config it by assigning the head indices to model._pipe.model.generation_config.alignment_heads.

Technically even models without alignment heads, such as distil-large-v2, will work as well by disabling word timestamps with model.transcribe('audio.mp3', word_timestamps=False). However, many features, such as regrouping and word-level timestamp adjustment, will be unavailable.