substratusai / lingo

Lightweight ML model proxy and autoscaler for kubernetes
https://www.substratus.ai
Apache License 2.0
96 stars 6 forks source link

Configurable concurrency per replica setting #12

Open samos123 opened 8 months ago

samos123 commented 8 months ago

Currently it seems Lingo is quickly creating more replicas as requests are incoming while the pod isn't ready to serve yet. It should be configurable how many requests a single pod can handle concurrently.

This could be done by using the following annotation in the deployment:

lingo.substratus.ai/concurrency: 100

In this case Lingo should only scale up when a single pod is handling more than 100 HTTP requests in parallel. I think a good default value is 100 which is also what knative uses: https://knative.dev/docs/serving/autoscaling/concurrency/#soft-versus-hard-concurrency-limits

nstogner commented 8 months ago

The setting is here: https://github.com/substratusai/lingo/blob/4350d67aedd7871c8397f27d4c6ab6c2d79e4865/main.go#L78

In this case, 1 is the concurrency setting.

samos123 commented 8 months ago

I have already changed the default from 1 to 100, but what's left is having this configurable through the annotation and reconciling on this as needed

alpe commented 6 months ago

Do I understand this correct, that you suggest a new annotation on the model deployment so that instead of a global value in lingo main this can be customized on the model level? This sounds very reasonable to me.

The deployment manager receives updates on Reconcile and could trigger a queue resize on the instance

samos123 commented 6 months ago

@alpe Yes that's correct. There should be a default global value. In addition, each deployment should be able to override the default global value by setting an annotation.