Closed zxOnVacation closed 1 year ago
Hi @jike-algorithm-zhangxiao ,
On trtexec
side:
it just cost 1+ms to infer
I believe you're only looking at the GPU compute time from the trtexec
results but not the other latencies involved in the end-to-end process. I think this line may be a bit more of an apples-to-apples comparison:
[12/15/2022-12:04:27] [I] Average on 10 runs - GPU latency: 1.98892 ms - Host latency: 6.94839 ms (enqueue 1.55723 ms)
which looks to be about ~9-11 ms (not sure if enqueue is included in Host or not).
On tritonserver
side:
It cost 13+ ms to exec a decoder inference
t = time.time()
d_rep = d_req.exec()
logging.error((time.time() - t) * 1000)
Off the top of my head I don't know exactly what else might be involved in the BLS pipeline other than model execution. I'm not sure if any extra copies/gathers would be involved if keeping memory in-GPU. @Tabrizian could you comment?
Hi @rmccorm4 In the log
[12/15/2022-12:04:27] [I] === Performance summary ===
[12/15/2022-12:04:27] [I] Throughput: 204.047 qps
[12/15/2022-12:04:27] [I] Latency: min = 6.80411 ms, max = 7.08138 ms, mean = 6.92449 ms, median = 6.91629 ms, percentile(90%) = 6.95837 ms, percentile(95%) = 7.01323 ms, percentile(99%) = 7.08138 ms
[12/15/2022-12:04:27] [I] Enqueue Time: min = 1.2683 ms, max = 2.39101 ms, mean = 1.40606 ms, median = 1.377 ms, percentile(90%) = 1.48555 ms, percentile(95%) = 1.55728 ms, percentile(99%) = 2.39101 ms
[12/15/2022-12:04:27] [I] H2D Latency: min = 4.79916 ms, max = 5.00237 ms, mean = 4.85155 ms, median = 4.84588 ms, percentile(90%) = 4.87872 ms, percentile(95%) = 4.94078 ms, percentile(99%) = 5.00237 ms
[12/15/2022-12:04:27] [I] GPU Compute Time: min = 1.87802 ms, max = 2.02445 ms, mean = 1.98437 ms, median = 1.98758 ms, percentile(90%) = 1.99373 ms, percentile(95%) = 1.99884 ms, percentile(99%) = 2.02445 ms
[12/15/2022-12:04:27] [I] D2H Latency: min = 0.0473633 ms, max = 0.0989113 ms, mean = 0.0885731 ms, median = 0.0892487 ms, percentile(90%) = 0.0930176 ms, percentile(95%) = 0.0947533 ms, percentile(99%) = 0.0989113 ms
we can see ~9-11ms that include H2D latency which cost too much about ~5-6ms. I think the H2D latency is data transfer between host and device, but in tritonserver side, the cache is just in device memory (cause i set
parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS"
value: {string_value:"no"}}
in python-model config.pbtxt and use dl_pack. So i think in the python-backend module, time cost should be just the device compute time?
Can you try running perf analyzer on the TRT model directly and share the output? Is 13ms observed in all the inferences or only the first inference? There could be an initial warmup time associated with the first few inferences which could be skewing your results.
Oh i just not use kv-cache, cause faster. thanks for your reply.
哦,我只是不使用kv缓存,因为更快。感谢您的回复。
请问怎么用triton部署whisper模型?
我是用api一层一层自己搭的
Which API did you use?
I wanna use triton+tensorrt to deploy whisper, a transformer-like arc asr model.
I wanna use kv-cache to accelerate inference speed, so i use python-backend and dlpack to do this when i build a decoder to tensorrt, use trtexec to measure the decoder performance as below
it just cost 1+ms to infer But when i use python-backend to infer as below:
It cost 13+ ms to exec a decoder inference I think dlpack is zero-copy, so what's the extra cost for python-backend inference? and how to fix my python-backend code to achieve the ~1ms speed inference?