k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi
https://k2-fsa.github.io/sherpa
Apache License 2.0
474 stars 97 forks source link

Add whisper TensorRT-LLM triton python backend support #551

Closed yuekaizhang closed 4 months ago

yuekaizhang commented 4 months ago

Support whisper TensorRT-LLM via triton python backend. See more performance data of TRT-LLM whisper here.

Comparing onnx fp16, it could accelerate about 7x. Decoding on a single V100 GPU, audios padded to 30s, using aishell1 test set files, it could process about 50 secs audio per second using client-server mode.

Model Backend Concurrency RTF
Large-v2 ONNX FP16 (deprecated) 4 0.14
Large-v3 TensorRT-LLM FP16 4 0.0209
yuekaizhang commented 4 months ago

@csukuangfj Would you mind reviewing it when you are free? Thanks.