k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi

https://k2-fsa.github.io/sherpa

Apache License 2.0

473 stars 97 forks source link

What about int8 weighs? #608

Open AntonThai2022 opened 3 weeks ago

AntonThai2022 commented 3 weeks ago

hello! I build int8 weights: INFERENCE_PRECISION=float16 WEIGHT_ONLY_PRECISION=int8 MAX_BEAM_WIDTH=4 MAX_BATCH_SIZE=8 checkpoint_dir=whisper_large_v3weights${WEIGHT_ONLY_PRECISION} output_dir=whisper_largev3${WEIGHT_ONLY_PRECISION}

Convert the large-v3 model weights into TensorRT-LLM format.

python3 convert_checkpoint.py \ --use_weight_only \ --weight_only_precision $WEIGHT_ONLY_PRECISION \ --output_dir $checkpoint_dir

So I got whisper_large_v3_weights_int8 and put it in sherpa/triton/whisper/model_repo_whisper_trtllm/whisper/1. But it does not work. I tried to change name on whisper_large_v3, but it did not help))) Is it real to launch int8 whisper in your repo and docker image?

csukuangfj commented 2 weeks ago

@yuekaizhang Could you have a look? Thanks!

yuekaizhang commented 2 weeks ago

hello! I build int8 weights: INFERENCE_PRECISION=float16 WEIGHT_ONLY_PRECISION=int8 MAX_BEAM_WIDTH=4 MAX_BATCH_SIZE=8 checkpoint_dir=whisper_large_v3weights${WEIGHT_ONLY_PRECISION} output_dir=whisper_largev3${WEIGHT_ONLY_PRECISION}

Convert the large-v3 model weights into TensorRT-LLM format.

python3 convert_checkpoint.py --use_weight_only --weight_only_precision $WEIGHT_ONLY_PRECISION --output_dir $checkpoint_dir

So I got whisper_large_v3_weights_int8 and put it in sherpa/triton/whisper/model_repo_whisper_trtllm/whisper/1. But it does not work. I tried to change name on whisper_large_v3, but it did not help))) Is it real to launch int8 whisper in your repo and docker image?

@AntonThai2022 Would you mind pasting the error logs here?

AntonThai2022 commented 2 weeks ago

I apologize for the created topic - I just mixed up the folder with intermediate weights and engine weights. I sat down and took the desired folder and renamed it - everything worked. It’s strange that the regular version takes 8GB, and 8 bit 7GB. Although the acceleration was almost doubled.