shashikg / WhisperS2T

An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine
MIT License
279 stars 25 forks source link

huggingface repository for ctranslate2 models or provide as alternative source #20

Closed BBC-Esq closed 6 months ago

BBC-Esq commented 7 months ago

Again, I am extremely impressed by this program. I've been pining for a long time for something that can do batch processing based on ctranslate2, especially since huggingface claimed to have the fastest implemantation.

What are your thoughts about a pull request to change the hf_utils.py script to list my repository for all various quantizations of the whisper models?

https://huggingface.co/ctranslate2-4you

I realize the ctranslate2 can quantize at runtime, but it does take some additional time and I figured users might want the option. Alternatively, would you be willing to a pull request on your repository linking my huggingface repository as an alternative source to the default "systran" models?

The difference is I've quantized every whisper model size to every ctranslate2 available quantization (except int16 of course)...EXCEPT large-v3 since I've noticed regressions with it. Either way, great job and I'm excited to include it into my programs!

If you're a visual person, here's an example of just my conversions of the large-v2 model:

image

shashikg commented 7 months ago

Hi can you check and report here -- what is the difference in load time for loading from float16 exported model vs int8 exported model when using int8 precision.

BBC-Esq commented 7 months ago

Hi can you check and report here -- what is the difference in load time for loading from float16 exported model vs int8 exported model when using int8 precision.

It'll depend on the GPU. I have a 4090. Unfortunately I don't have my old script anymore since I already did this testing and made the changes to my program...However, if you'll tell me how to specify a directory instead of automatically downloading the ctranslate2 checkpoints in this script, I can re-test.

Also, I would add that my repository has the btloat16 and float32 ctranslate2 versions. I've found, for example, that the small model running in float32 is about the same (in terms of quality) as medium in float16...while using less vram AND takes less time. So there's an advantage there in having float32 as an option.

Moreover, if a user wants to use bfloat16 (e.g. with newer gpus that support that compute level), converting at runtime from float16 to bfloat16 results in "slightly" lower quality. Ideally, you'd want to use the float32 model converted to bfloat16 at runtime...or simply use the bfloat16 model from the beginning.

Lastly, if a user wants to use int8_float32, int8_bfloat16 that's supported by ctranslate2, you'd want to convert from the original float32 version to prevent data loss and a slight decrease in quality (of course the time to convert is a factor as well discussed previously). For example, you wouldn't want to convert a float16 to int8_float32...whereas, it's my understanding that converting from float16 to int8_float16 wouldn't suffer from the decrease...Hope that makes sense. Before the creator of faster-whisper took at job at Apple I verified this with him. It's a small difference in quality, but a difference nonetheless (per @guilliakan).

One great thing about ctranslate2 is that it'll automatically do the conversion at runtime and pick the next best one - e.g. even if you specify bfloat16 with a GPU that doesn't support it. It's just not ideal.

If you ONLY plan on supporting int8 or float16, which covers 90% of peoples' use cases, that's a different story, but I thought it'd be nice to have the option there for people.

Anyhow, here's the script:

import whisper_s2t
from whisper_s2t.backends.ctranslate2.model import BEST_ASR_CONFIG, FAST_ASR_OPTIONS

model_kwargs = {
    'compute_type': 'float16',
    #'asr_options': BEST_ASR_CONFIG
    'asr_options': FAST_ASR_OPTIONS
}

model = whisper_s2t.load_model(model_identifier="large-v2", backend='CTranslate2', **model_kwargs)

files = ['test_audio_flac.flac']
lang_codes = ['en']
tasks = ['transcribe']
initial_prompts = [None]

out = model.transcribe_with_vad(files,
                                lang_codes=lang_codes,
                                tasks=tasks,
                                initial_prompts=initial_prompts,
                                batch_size=20)

# Concatenate the text from all utterances
transcription = " ".join([_['text'] for _ in out[0]]).strip()

with open('transcription.txt', 'w') as f:
    f.write(transcription)
BBC-Esq commented 7 months ago

Can you please tell me how to load a model from a directory?

shashikg commented 7 months ago

Can you please tell me how to load a model from a directory?

model = whisper_s2t.load_model(model_identifier="large-v2", backend='CTranslate2')

Here model_identifier supports three different values for CTranslate2 and HuggingFace backend. model_name, local_path, or hf_repo_id.

So, in your case simply this should work:

model = whisper_s2t.load_model(model_identifier="ctranslate2-4you/whisper-large-v2-ct2-int8_float16", backend='CTranslate2', compute_type='int8')

Or to load from local directory:

model = whisper_s2t.load_model(model_identifier="your_local_path_to_model_files", backend='CTranslate2')

To get the path where the downloaded model got saved: print(model.model_path)

PS: I'm working on detailed docs along with couple of other minor features.. most probably will update those by end of this month.

BBC-Esq commented 7 months ago

Thanks, I'll try to get the conversion times for you as you asked. What did you think about my other reasons for including a wide variety of quantizations for the whisper models? Or do you only want to change things if there's a significant load time? I figured your program could continue to rely on the systran models, but offer users another option at least...

shashikg commented 7 months ago

Or do you only want to change things if there's a significant load time?

Yes, if there's isn't any significant difference in load_time or inference_time or accuracy. I think it won't make sense to target different repository for different configurations. If anyone want to still use any other repo, they can still use it by passing the specific repo_id instead of just model name.

In future actually I am planning to remove systran links as well. Working on unified tarred representation for different backends, which can be easily saved as some exported model like: whisper_large_ct2_int8.wst. It is a part of this: https://github.com/shashikg/WhisperS2T/issues/8#issuecomment-1919818938

BBC-Esq commented 6 months ago

Closing for lack of interest.