Open Vincouux opened 1 year ago
Starting to answering some of my questions in case someone has the same issue. It seems like the python client is using geventhttpclient
client to communicate with the server. Based on their documentation, it seems like this client is using persistent connections: geventhttpclient has been specifically designed for high concurrency, streaming and support HTTP 1.1 persistent connections.
. That explains why the sessions are "sticky".
But, it looks like this HTTP client provides some settings (connection_timeout, network_timeout, concurrency), which should allow to correctly distribute the load as the connection will be closed over time. But it doesn't seem to be the case.
Am I missing something?
After further investigation, it sounds like the geventhttpclient
is lazy managing a pool of connection, which means it will create connections only when none is available. Therefore, if you use Triton in a synchronous way without concurrency (one request at a time), the client will reuse the same connection over time.
This is problematic to properly load balance the work load when you have around the same number of clients and servers.
For now, my solution is to close all the connections (only one for me, but could be any amount) after each request. This way, the load balancing will occur at service time instead of up front.
we are having the same problem with triton + kube but on grpc side. It seems that triton isn't designed to work with a dynamic number of PODs on the server side. I think we should have a proper solution on triton side directly that is more proper then recreating clients or closing and reopening the connection for each request
@bashirmindee Hi! Do you use any service mesh that can load balance gRPC traffic? If not, you should make your service headless in k8s and write load balancing on the client.
thank you for this. We tried service mesh istio but the results weren't as good as we hoped
Description I deployed Triton Inference Server on Kubernetes (GKE). To balance the load, I created a Load Balancer Service. As a client, I'm using the Python HTTP client. I was expecting all the (inference) request to be distributed across the replication, but it looks like all the request are going to the same pod. Restarting a client is getting it mapped randomly to a pod. It looks like the sessions are "sticky". I made sure that this is not the case in the Service configuration. Looking at the documentation, there is no keep alive setting with the HTTP client. Is the HTTP client keeping the connections alive? Is it intended? Can we disable it? In term of scalability, it's not great as the work is not evenly distributed.
Triton Information
tritonserver:23.05-py3
Are you using the Triton container or did you build it yourself? Triton container
To Reproduce
Expected behavior I expected the traffic to be evenly distributed to balance the inference across the replicas.