Open guoners opened 1 year ago
What is your client set up to look like?
What is your client set up to look like?
I test single query with Triton http inference protocol. And with a third party LNP tools (locust) with 1 to 10 close-loop (send next query when receive) clients.
Description I deployed a bert_base model from hugging face's transformer library via torchscript and Triton's pytorch backend. But i found the GPU utilization is around 0, and performance is far below the local test.
My Triton configuration and performance test environment is valid: i test with other torch backend model and the expected performance and GPU utilization were observed.
Triton Information Triton container: 23.01 from ngc
To Reproduce
I got model from transformer library and convert it to torchscript as below (model file). I deployed the model to Triton via http with following model config (model config).
In a rough test, the latency is 700ms and the RPS (request per second) is around 1.5; The GPU utilization is around 0. While the local test for the same .pt file (torchscript model engine) is around 8ms and 100 RPS with a 30% GPU utilization.
Through further observation, I found that the model actually used gpu, but for some reasons, the triton latency was very high (750ms), but the calculation time proportion of the model was very low (2ms), resulting in a low average utilization rate of GPU. What causes the gap between triton latency and compute time? I tested with torchscript, onnx, and tensorrt backend with same performance.
execution log:
The model file:
The model config:
Expected behavior Can you guys help me check the process, I want to know why the model doesn't use Gpus and how can I fix this. Thanks a lot!