Model & Instance scaling

📚 The doc issue

Hi, I have some questions regarding model scaling.

We're currently running one torchserve instance on a kubernetes cluster for a bunch of models that come under a variety of loads throughout the day at different times. After a bit of digging I've worked out that there is no auto scaling feature in torchserve (which is a bit misleading with the minWorkers and maxWorkers). Therefore, it seems our only solution would be to have horizontal scaling on kubernetes documented here https://github.com/pytorch/serve/blob/master/kubernetes/autoscale.md. However, as our models have a varying degree of load at any one time we don't really want to be scaling them all at the same time.

Are we doing something fundamentally wrong with our setup? Maybe one torchserve instance for each "group" of models
If our setup isn't flawed, would it be worth creating a sidecar container or separate application which monitors the queue time for each model and scales them up or down using the management API?

Thanks

Suggest a potential alternative/fix

No response

pytorch / serve

Model & Instance scaling #3358

📚 The doc issue

Suggest a potential alternative/fix