jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.59k stars 176 forks source link

Huggingface's Fine Tuned model that can be used? #378

Open Patrick10731 opened 4 months ago

Patrick10731 commented 4 months ago

I tryed to use distil-whisper-v3 in stable-ts and it can be used. However, it's unable to be used when I try to use "distil-large-v2". Other model can't be used too.(ex:kotoba-whisper,"kotoba-tech/kotoba-whisper-v1.0") What kind of model can be used in stable-ts except for OpenAI's model?

import stable_whisper

model = stable_whisper.load_hf_whisper('distil-whisper/distil-large-v3', device='cpu') result = model.transcribe('audio.mp3')

result.to_srt_vtt('audio.srt', word_level=False)

jianfch commented 4 months ago

The models with preconfigured alignment heads or ones compatible with original heads will work. For the ones compatible with the original heads, you can manually config it by assigning the head indices to model._pipe.model.generation_config.alignment_heads.

Technically even models without alignment heads, such as distil-large-v2, will work as well by disabling word timestamps with model.transcribe('audio.mp3', word_timestamps=False). However, many features, such as regrouping and word-level timestamp adjustment, will be unavailable.

dgoryeo commented 1 month ago

Hi @Patrick10731 , did you get any of kotoba-whisper models to work with Stable_ts? I am trying their kotoba-tech/kotoba-whisper-v2.1 model, but I keep getting out of memory error.

@jianfch , I'm not sure if you have already come across kotobal-tech models in Huggingface. Their latest model is using Stable-ts for accurate timestamp and regroup. I thought you might be interested.

Patrick10731 commented 1 month ago

@jianfch Thanks, it worked

@dgoryeo I confirmed that this code will work, try it.


import stable_whisper

model = stable_whisper.load_hf_whisper('kotoba-tech/kotoba-whisper-v1.1', device='cpu')
result = model.transcribe('audio.mp3', word_timestamps=False)

result.to_srt_vtt('audio.srt', word_level=False)

I also found that many models still won't work but will work if you convert the model into faster-whisper's model.

For example, this model won't work

import stable_whisper

model = stable_whisper.load_hf_whisper('Scrya/whisper-large-v2-cantonese', device='cpu')
result = model.transcribe('audio.mp3', word_timestamps=False)

result.to_srt_vtt('audio.srt', word_level=False)

But following code will work.

import stable_whisper

model = stable_whisper.load_faster_whisper('XA9/faster-whisper-large-v2-cantonese-2', device='cpu', compute_type='default')
result = model.transcribe_stable('audio.mp3')
result.to_srt_vtt('audio.srt', word_level=False)

This converted model is from here (https://huggingface.co/XA9/faster-whisper-large-v2-cantonese-2), and this model is converted by using following command.

 ct2-transformers-converter --model Scrya/whisper-large-v2-cantonese --output_dir faster-whisper-large-v2-cantonese-2 --copy_files  preprocessor_config.json --quantization float16

So I recommend to try converting model if a model won't work.

dgoryeo commented 1 month ago

Thank you @Patrick10731 , by any chance have you tried Kotoba's v2.1 (which is a distilled) Whisper?

I will try to follow your recommendation. At the moment I am running out of memory with 2.1 but I haven't tried on CPU only --I've tried device=cuda so far.

Patrick10731 commented 1 month ago

@dgoryeo I tried with this code and it worked. How about you to try setting device='cpu'? The reason of running out of memory must be lacking of performance of your video card.


import stable_whisper

model = stable_whisper.load_hf_whisper('kotoba-tech/kotoba-whisper-v2.1', device='cpu')
result = model.transcribe('audio.mp3', word_timestamps=False)

result.to_srt_vtt('audio.srt', word_level=False)
dgoryeo commented 1 month ago

Thanks @Patrick10731 , I will test it on cpu. I have 12GB gpu vram, so didn't expect to run out of memory.. I'll test and will report back.

jianfch commented 1 month ago

@dgoryeo 12GB might be too low for the default batch_size=24. Try smaller batch_size.

dgoryeo commented 1 month ago

@jianfch , that must be it. I'll change the batch_size accordingly.

When I use the model directly with transformers, I use batch_size 16 with no problem:

    pipe = pipeline(
        model=model_id,
        torch_dtype=torch_dtype,
        device=device,
        model_kwargs=model_kwargs,
        chunk_length_s=15,
        batch_size=16,
        trust_remote_code=True,
        stable_ts=True,
        punctuator=True
    )

Thanks

jianfch commented 1 month ago

@dgoryeo You can pass this pipe directly to the pipeline parameter of stable_whisper.load_hf_whisper().

dgoryeo commented 1 month ago

Here to reporting back that it worked.

I tested the both options: (a) direct calling model = stable_whisper.load_hf_whisper('kotoba-tech/kotoba-whisper-v2.1', device='cuda'), and (b) passing the pipe parameter to stable_whisper.load_hf_whisper(), device cuda.

Both worked. Though I was happier with the results of (a).