triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.4k stars 1.49k forks source link

Triton replication on Kubernetes, all traffic forwarded to the same pod #6177

Open Vincouux opened 1 year ago

Vincouux commented 1 year ago

Description I deployed Triton Inference Server on Kubernetes (GKE). To balance the load, I created a Load Balancer Service. As a client, I'm using the Python HTTP client. I was expecting all the (inference) request to be distributed across the replication, but it looks like all the request are going to the same pod. Restarting a client is getting it mapped randomly to a pod. It looks like the sessions are "sticky". I made sure that this is not the case in the Service configuration. Looking at the documentation, there is no keep alive setting with the HTTP client. Is the HTTP client keeping the connections alive? Is it intended? Can we disable it? In term of scalability, it's not great as the work is not evenly distributed.

Triton Information tritonserver:23.05-py3

Are you using the Triton container or did you build it yourself? Triton container

To Reproduce

Expected behavior I expected the traffic to be evenly distributed to balance the inference across the replicas.

Capture d’écran du 2023-08-13 23-32-22

Grafana panel showing number of inference per second for each replicas of Triton (2). The clients are randomly sticking to one replicas, but each client doesn't produce the same number of inference per second. Therefore, the work is not evenly spreaded.
Vincouux commented 1 year ago

Starting to answering some of my questions in case someone has the same issue. It seems like the python client is using geventhttpclient client to communicate with the server. Based on their documentation, it seems like this client is using persistent connections: geventhttpclient has been specifically designed for high concurrency, streaming and support HTTP 1.1 persistent connections.. That explains why the sessions are "sticky".

But, it looks like this HTTP client provides some settings (connection_timeout, network_timeout, concurrency), which should allow to correctly distribute the load as the connection will be closed over time. But it doesn't seem to be the case.

Am I missing something?

Vincouux commented 1 year ago

After further investigation, it sounds like the geventhttpclient is lazy managing a pool of connection, which means it will create connections only when none is available. Therefore, if you use Triton in a synchronous way without concurrency (one request at a time), the client will reuse the same connection over time.

This is problematic to properly load balance the work load when you have around the same number of clients and servers.

For now, my solution is to close all the connections (only one for me, but could be any amount) after each request. This way, the load balancing will occur at service time instead of up front.

bashirmindee commented 3 months ago

we are having the same problem with triton + kube but on grpc side. It seems that triton isn't designed to work with a dynamic number of PODs on the server side. I think we should have a proper solution on triton side directly that is more proper then recreating clients or closing and reopening the connection for each request

victornguen commented 2 months ago

@bashirmindee Hi! Do you use any service mesh that can load balance gRPC traffic? If not, you should make your service headless in k8s and write load balancing on the client.

bashirmindee commented 2 months ago

thank you for this. We tried service mesh istio but the results weren't as good as we hoped