k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi
https://k2-fsa.github.io/sherpa
Apache License 2.0
473 stars 97 forks source link

Transcribe failed frequently after set batch size to 4 #606

Closed evanxqs closed 4 weeks ago

evanxqs commented 4 weeks ago

Hi,

I got a lot of transcription failures after set batch size to 4 and rebuild the tensortllm with below parameters: export INFERENCE_PRECISION=float16 export MAX_BEAM_WIDTH=4 export MAX_BATCH_SIZE=4 export checkpoint_dir=tllm_checkpoint export output_dir=whisper_large_v3

2024-06-03 02:05:05 ERROR (exception.py:183):handler(): An error occurred: [StatusCode.INTERNAL] Failed to process the request(s) for model instance 'whisper_0_0', message: RuntimeError: Could not set shape torch.Size([5, 128, 3000]) for tensor x. Please check the profile range for which your model was build.

At: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/session.py(195): infer_shapes /workspace/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(64): get_audio_features /workspace/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(202): process_batch /workspace/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/model.py(108): execute

csukuangfj commented 4 weeks ago

@yuekaizhang Could you have a look?

yuekaizhang commented 4 weeks ago

Hi,

I got a lot of transcription failures after set batch size to 4 and rebuild the tensortllm with below parameters: export INFERENCE_PRECISION=float16 export MAX_BEAM_WIDTH=4 export MAX_BATCH_SIZE=4 export checkpoint_dir=tllm_checkpoint export output_dir=whisper_large_v3

2024-06-03 02:05:05 ERROR (exception.py:183):handler(): An error occurred: [StatusCode.INTERNAL] Failed to process the request(s) for model instance 'whisper_0_0', message: RuntimeError: Could not set shape torch.Size([5, 128, 3000]) for tensor x. Please check the profile range for which your model was build.

At: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/session.py(195): infer_shapes /workspace/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(64): get_audio_features /workspace/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(202): process_batch /workspace/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/model.py(108): execute

@evanxqs Please increase the max_batch_size > 5. The wrong tesor has shape [5, 128, 3000], whose first dim is batch_size.

evanxqs commented 4 weeks ago

And I wonder is there any performance improvement if switch from batch size 8 to 5 or other values.

Under batch size = 8, I ran a performance test and found that there's almost no difference with Large-v3 and Large-v2 with 10s audio inference. It's weird since in README the Large-v3 only takes 1/10 of RTF. Is that any other parameter needs to be set ?

image

yuekaizhang commented 4 weeks ago

And I wonder is there any performance improvement if switch from batch size 8 to 4.

Under batch size = 8, I ran a performance test and found that there's almost no difference with Large-v3 and Large-v2 with 10s audio inference. It's weird since in README the Large-v3 only takes 1/10 of RTF.

image

For triton service tuning, I recommand to benchmark using a whole dataset rather than a single wav file. https://github.com/k2-fsa/sherpa/tree/master/triton/whisper#benchmark-using-dataset.

Also, please check https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/log/stats_summary.txt. It could help you do some configuration modifications.

For tensorrt-llm engines' performance tuning, you may first start with offline inference mode. Please try https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py first.

evanxqs commented 4 weeks ago

And I wonder is there any performance improvement if switch from batch size 8 to 4. Under batch size = 8, I ran a performance test and found that there's almost no difference with Large-v3 and Large-v2 with 10s audio inference. It's weird since in README the Large-v3 only takes 1/10 of RTF. image

For triton service tuning, I recommand to benchmark using a whole dataset rather than a single wav file. https://github.com/k2-fsa/sherpa/tree/master/triton/whisper#benchmark-using-dataset.

Also, please check https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/log/stats_summary.txt. It could help you do some configuration modifications.

For tensorrt-llm engines' performance tuning, you may first start with offline inference mode. Please try https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py first.

Thank you so much~!

danpovey commented 4 weeks ago

Can someone explain why the batch size has to be more than 5? From "Please check the profile range for which your model was build." it seems that some model optimization step may require the batch size to be in a certain range, but I don't understand why. The only matching thing I find online is https://github.com/NVIDIA/TensorRT-LLM/issues/1092

yuekaizhang commented 4 weeks ago

Can someone explain why the batch size has to be more than 5? From "Please check the profile range for which your model was build." it seems that some model optimization step may require the batch size to be in a certain range, but I don't understand why. The only matching thing I find online is NVIDIA/TensorRT-LLM#1092

Sorry for not explaining it clearly. When exporting the engine for TensorRT or TensorRT-LLM, we need to set the min, optimal, and max shape range for each input tensor. For example, when setting the batch size for the Whisper encoder, you can refer to this link: Whisper Encoder Batch Size Setting.

The tensor sizes for model inference must fall within this range. In this issue, the max_batch_size is set to 4, but an attempt was made to use 5 for inference. Using the input tensor with the optimal shape will yield the best performance.

The reason for setting this range, although I am not a CUDA expert, is likely because the GEMM accelerated CUDA kernels that the engine can choose from will vary depending on the tensor size. Choosing different CUDA kernels can have a certain impact on performance. For instance, if the max_batch_size is set too large, it might limit the available CUDA kernel choices or result in no available kernels at all. @danpovey