How to identify "full" torchserve instances on Google Kubernetes Engine

We're currently trying to deploy torchserve on scale on Kubernetes. We have highly fluctuating requests, basically every 5 minutes some requests come in with nothing in-between, and sometimes there'll be huge spikes. Therefore we want small pods that scale aggressively as soon as load comes in.

Here comes the issues: based on what metric can we scale and is there a way to identify pods that are at their limit?

For scaling we currently just use cpu usage, queueLength would be ideal. For that we probably have to wait on #2101, right?

Once scaling has happened, k8s has no way of knowing which pods can actually serve requests (one request can take up to 10 seconds, so a full queue will stay full for a while). Again, readiness probe on queueLength would be ideal. queueTime will only tell us that we should have scaled x seconds ago.

We've come up with the solution of using the readinessProbe to send a dummy request to the handler to check whether it gets denied immediately. But that can't be it, right? Surely, this problem can't be so unique that there is no better solution.

I apologize in advance if this is not the right place to ask this question, I couldn't find anything better.

pytorch / serve

How to identify "full" torchserve instances on Google Kubernetes Engine #2412