tutorials/Quick_Deploy/vLLM the triton metrics inference delay result is abnormal？

triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html

BSD 3-Clause "New" or "Revised" License

8.12k stars 1.46k forks source link

tutorials/Quick_Deploy/vLLM the triton metrics inference delay result is abnormal？ #6448

Open activezhao opened 11 months ago

activezhao commented 11 months ago

I use tutorials/Quick_Deploy/vLLM to deploy codeLlama 7B, then I call metrics API, and a part of the metrics info is:

nv_inference_request_summary_us_count{model="triton-vllm-code-llama-model",version="1"} 2639 nv_inference_request_summary_us_sum{model="triton-vllm-code-llama-model",version="1"} 2885180

I use grafana to display the metrics information of triton with the following calculation formula, and I get the inference delay is 400 μs sum(rate(nv_inference_request_summary_us_sum)) / sum(rate(nv_inference_request_summary_us_count))

but the real avg inference delay is 300 - 500ms，so does tutorials/Quick_Deploy/vLLM the triton metrics API work? Or is there something wrong with the way I'm using it?

jbkyang-nvi commented 11 months ago

How did you calculate the average inference delay? Is it from the client side? cc: @rmccorm4 about metrics calcluations

activezhao commented 11 months ago

How did you calculate the average inference delay? Is it from the client side? cc: @rmccorm4 about metrics calcluations

I just collect the metrics data by calling " :8002/metrics" API, and the metrics data will be saved to prometheus, then I use grafana to display the metrics information, the calculation formula is: sum(rate(nv_inference_request_summary_us_sum)) / sum(rate(nv_inference_request_summary_us_count))

However, as I have said, the latency looks like abnormal, the real avg latency is about 300 - 500ms, but I got 400 μs.

In fact, I used Fastertransformer + Triton Server in the past, and the latency was normal by the same way as above. So I do not know why the result is abnormal.

jbkyang-nvi commented 11 months ago

I meant that how did you get 300-500ms? But I agree, the metrics should behave the same way as it would have with Fastertransformer backend.

activezhao commented 11 months ago

I meant that how did you get 300-500ms? But I agree, the metrics should behave the same way as it would have with Fastertransformer backend.

Let me describe it in more detail. I will calculate the time difference before and after calling the Triton Server infer interface. It may not be accurate, but it will not be too different.

In the code completion scenario, if the parameter of "stop" field is assigned the value "\n", it usually costs about 400ms. If the parameter of "stop" field is not used, and the parameter of "max_new_tokens" field is assigned the value 100, it usually costs about 2 seconds, but it definitely can’t be 400μs.

In fact, I also want to know if you have tested the metrics data and whether it is normal?

Thank u.