triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.38k stars 1.49k forks source link

High Queue Latency With BLS #7622

Open SandraWang-SH opened 2 months ago

SandraWang-SH commented 2 months ago

Description of problem: We are aimed to import BLS into Triton Server. After add BLS code in model, we fund that the latency of model has increased significantly.

On the Datadog, it can be seen that the latency of Ensemble is much lower than that of BLS. Of course, their server resource allocation and traffic are the same.

Avg: Call watch tower ensemble: 6.30ms Call watch tower use BLS: 32.04ms

image

After a series of monitoring, it was found that as traffic increases, queue latency also increases.

image

Do you know why BLS causes the queues? The performance of BLS seems to be much worse than Ensemble. Any idea about this?

SandraWang-SH commented 2 months ago

@Tabrizian Hi, Sorry to bother you,Could you please take some time to check this issue? Many thanks, Look forward to your reply.

MatthieuToulemont commented 1 month ago

I think the issue might come from the BLS does not benefit from the ensemble scheduling.

Let's say you have a pipeline with 3 steps A,B,C and 4 Requests (R1,R2,R3,R4) in the queue.

In the BLS case, R2 will start to be processed once R1 has gone through all three steps A,B,C.

In the ensemble case, as soon as R1 moves from step A to step B, R2 starts to be processed in step A. As you can imagine this much more efficient as the requests spend way less time in the queue.

I would advise using ensembles as much as you can, keeping in mind they don't allow for control flow in between models. Below you will find a rough simulation of what would happen for a three steps model.

On top of the ensemble scheduling, I have observed that if you use the python and torch in your BLS model this adds a significant overhead to the pipeline.

Screenshot 2024-10-09 at 12 22 52

Screenshot 2024-10-09 at 12 23 03

Screenshot 2024-10-09 at 12 23 15 Screenshot 2024-10-09 at 12 23 46

XiaoxueWang1 commented 1 month ago

Hi @MatthieuToulemont Thanks for your reply. Your explanation makes sense. Has Triton considered optimizing BLS to achieve similar performance to ensemble? In fact, we have 3 steps A, B, and C. We want to add Cal Log (a log platform that can record transactions and events) to record the failure of each step. If BLS is used, appropriate logs can be printed when any step fails. If ensemble is used, there is an internal error and we have no way of knowing which step failed. BLS is very suitable for this Cal log scenario. However, the increase in latency prevents us from using BLS.

MatthieuToulemont commented 1 month ago

Has Triton considered optimizing BLS to achieve similar performance to ensemble?

I have no idea (I don't work at nvidia :) )

If ensemble is used, there is an internal error and we have no way of knowing which step failed.

Depending on the verbosity you set for triton you should be able to know in which step an error occurs.Which logging level do you use ?