A10卡GPU推理效率和CPU持平，不清楚是什么地方的问题

Notice: In order to resolve issues more efficiently, please raise issue following the template. （注意：为了更加高效率解决您遇到的问题，请按照模板提问，补充细节）

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

参考：https://github.com/modelscope/FunASR/blob/e8f535f53320780cd8ed6f3b8588b187935d3ae5/runtime/onnxruntime/readme.md 编译出onnxruntime的二进制版本，也打开了GPU=ON

开启量化后的合成效果加速比最大只有300左右，和CPU版本非常接近。看GPU利用率确实也有70%左右，这个是为什么呢。

Code

编译命令： cmake -DCMAKE_BUILD_TYPE=release .. -DONNXRUNTIME_DIR=/home/ubuntu/github/FunASR/onnxruntime-linux-x64-1.14.0 -DFFMPEG_DIR=/home/ubuntu/github/FunASR/ffmpeg-master-latest-linux64-gpl-shared -DGPU=on

模型导出方式：

funasr-export ++model=damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch ++quantize=true ++device=cuda ++type=torchscript

推理命令：

funasr-onnx-offline-rtf --model-dir /home/ubuntu/.cache/modelscope/hub/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch --vad-dir /home/ubuntu/.cache/modelscope/hub/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch --punc-dir /home/ubuntu/.cache/modelscope/hub/damo/punc_ct-transformer_cn-en-common-vocab471067-large --gpu --thread-num 20 --batch-size 48 --quantize true --wav-path ./test100.scp

和

What have you tried?

What's your environment?

OS (e.g., Linux):
FunASR Version (e.g., 1.0.0):
ModelScope Version (e.g., 1.11.0):
PyTorch Version (e.g., 2.0.0):
How you installed funasr (pip, source):
Python version:
GPU (e.g., V100M32)
CUDA/cuDNN version (e.g., cuda11.7):
Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1)
Any other relevant information:
python=3.8 funasr、modelscope都是最新的

modelscope / FunASR