Closed BBC-Esq closed 6 months ago
Hi can you check and report here -- what is the difference in load time for loading from float16 exported model vs int8 exported model when using int8 precision.
Hi can you check and report here -- what is the difference in load time for loading from float16 exported model vs int8 exported model when using int8 precision.
It'll depend on the GPU. I have a 4090. Unfortunately I don't have my old script anymore since I already did this testing and made the changes to my program...However, if you'll tell me how to specify a directory instead of automatically downloading the ctranslate2 checkpoints in this script, I can re-test.
Also, I would add that my repository has the btloat16
and float32
ctranslate2 versions. I've found, for example, that the small
model running in float32
is about the same (in terms of quality) as medium
in float16
...while using less vram AND takes less time. So there's an advantage there in having float32
as an option.
Moreover, if a user wants to use bfloat16
(e.g. with newer gpus that support that compute level), converting at runtime from float16
to bfloat16
results in "slightly" lower quality. Ideally, you'd want to use the float32
model converted to bfloat16
at runtime...or simply use the bfloat16
model from the beginning.
Lastly, if a user wants to use int8_float32
, int8_bfloat16
that's supported by ctranslate2, you'd want to convert from the original float32
version to prevent data loss and a slight decrease in quality (of course the time to convert is a factor as well discussed previously). For example, you wouldn't want to convert a float16
to int8_float32
...whereas, it's my understanding that converting from float16
to int8_float16
wouldn't suffer from the decrease...Hope that makes sense. Before the creator of faster-whisper took at job at Apple I verified this with him. It's a small difference in quality, but a difference nonetheless (per @guilliakan).
One great thing about ctranslate2 is that it'll automatically do the conversion at runtime and pick the next best one - e.g. even if you specify bfloat16
with a GPU that doesn't support it. It's just not ideal.
If you ONLY plan on supporting int8
or float16
, which covers 90% of peoples' use cases, that's a different story, but I thought it'd be nice to have the option there for people.
Anyhow, here's the script:
import whisper_s2t
from whisper_s2t.backends.ctranslate2.model import BEST_ASR_CONFIG, FAST_ASR_OPTIONS
model_kwargs = {
'compute_type': 'float16',
#'asr_options': BEST_ASR_CONFIG
'asr_options': FAST_ASR_OPTIONS
}
model = whisper_s2t.load_model(model_identifier="large-v2", backend='CTranslate2', **model_kwargs)
files = ['test_audio_flac.flac']
lang_codes = ['en']
tasks = ['transcribe']
initial_prompts = [None]
out = model.transcribe_with_vad(files,
lang_codes=lang_codes,
tasks=tasks,
initial_prompts=initial_prompts,
batch_size=20)
# Concatenate the text from all utterances
transcription = " ".join([_['text'] for _ in out[0]]).strip()
with open('transcription.txt', 'w') as f:
f.write(transcription)
Can you please tell me how to load a model from a directory?
Can you please tell me how to load a model from a directory?
model = whisper_s2t.load_model(model_identifier="large-v2", backend='CTranslate2')
Here model_identifier
supports three different values for CTranslate2 and HuggingFace backend. model_name
, local_path
, or hf_repo_id
.
So, in your case simply this should work:
model = whisper_s2t.load_model(model_identifier="ctranslate2-4you/whisper-large-v2-ct2-int8_float16", backend='CTranslate2', compute_type='int8')
Or to load from local directory:
model = whisper_s2t.load_model(model_identifier="your_local_path_to_model_files", backend='CTranslate2')
To get the path where the downloaded model got saved: print(model.model_path)
PS: I'm working on detailed docs along with couple of other minor features.. most probably will update those by end of this month.
Thanks, I'll try to get the conversion times for you as you asked. What did you think about my other reasons for including a wide variety of quantizations for the whisper models? Or do you only want to change things if there's a significant load time? I figured your program could continue to rely on the systran models, but offer users another option at least...
Or do you only want to change things if there's a significant load time?
Yes, if there's isn't any significant difference in load_time or inference_time or accuracy. I think it won't make sense to target different repository for different configurations. If anyone want to still use any other repo, they can still use it by passing the specific repo_id instead of just model name.
In future actually I am planning to remove systran links as well. Working on unified tarred representation for different backends, which can be easily saved as some exported model like: whisper_large_ct2_int8.wst
. It is a part of this: https://github.com/shashikg/WhisperS2T/issues/8#issuecomment-1919818938
Closing for lack of interest.
Again, I am extremely impressed by this program. I've been pining for a long time for something that can do batch processing based on ctranslate2, especially since huggingface claimed to have the fastest implemantation.
What are your thoughts about a pull request to change the
hf_utils.py
script to list my repository for all various quantizations of the whisper models?https://huggingface.co/ctranslate2-4you
I realize the ctranslate2 can quantize at runtime, but it does take some additional time and I figured users might want the option. Alternatively, would you be willing to a pull request on your repository linking my huggingface repository as an alternative source to the default "systran" models?
The difference is I've quantized every whisper model size to every ctranslate2 available quantization (except
int16
of course)...EXCEPT large-v3 since I've noticed regressions with it. Either way, great job and I'm excited to include it into my programs!If you're a visual person, here's an example of just my conversions of the large-v2 model: