Add whisper TensorRT-LLM triton python backend support

k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi

https://k2-fsa.github.io/sherpa

Apache License 2.0

474 stars 97 forks source link

Add whisper TensorRT-LLM triton python backend support #551

Closed yuekaizhang closed 4 months ago

yuekaizhang commented 4 months ago

Support whisper TensorRT-LLM via triton python backend. See more performance data of TRT-LLM whisper here.

Comparing onnx fp16, it could accelerate about 7x. Decoding on a single V100 GPU, audios padded to 30s, using aishell1 test set files, it could process about 50 secs audio per second using client-server mode.

Model	Backend	Concurrency	RTF
Large-v2	ONNX FP16 (deprecated)	4	0.14
Large-v3	TensorRT-LLM FP16	4	0.0209

yuekaizhang commented 4 months ago

@csukuangfj Would you mind reviewing it when you are free? Thanks.