pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.17k stars 845 forks source link

How to identify "full" torchserve instances on Google Kubernetes Engine #2412

Open tsteffek opened 1 year ago

tsteffek commented 1 year ago

We're currently trying to deploy torchserve on scale on Kubernetes. We have highly fluctuating requests, basically every 5 minutes some requests come in with nothing in-between, and sometimes there'll be huge spikes. Therefore we want small pods that scale aggressively as soon as load comes in.

Here comes the issues: based on what metric can we scale and is there a way to identify pods that are at their limit?

For scaling we currently just use cpu usage, queueLength would be ideal. For that we probably have to wait on #2101, right?

Once scaling has happened, k8s has no way of knowing which pods can actually serve requests (one request can take up to 10 seconds, so a full queue will stay full for a while). Again, readiness probe on queueLength would be ideal. queueTime will only tell us that we should have scaled x seconds ago.

We've come up with the solution of using the readinessProbe to send a dummy request to the handler to check whether it gets denied immediately. But that can't be it, right? Surely, this problem can't be so unique that there is no better solution.

I apologize in advance if this is not the right place to ask this question, I couldn't find anything better.

cjidboon94 commented 1 year ago

@tsteffek Would KServe be an option for you? It integrates well with Kubernetes uses knative under the hood and allows you to scale on concurrency (queueLength per pod as I'd interpret it) and a few other metrics, and also can aggressively scale down pods if necessary (or keep them up a certain timeout to prevent constant cold starts)