Support whisper TensorRT-LLM via triton python backend. See more performance data of TRT-LLM whisper here.
Comparing onnx fp16, it could accelerate about 7x.
Decoding on a single V100 GPU, audios padded to 30s, using aishell1 test set files, it could process about 50 secs audio per second using client-server mode.
Support whisper TensorRT-LLM via triton python backend. See more performance data of TRT-LLM whisper here.
Comparing onnx fp16, it could accelerate about 7x. Decoding on a single V100 GPU, audios padded to 30s, using aishell1 test set files, it could process about 50 secs audio per second using client-server mode.