Open samos123 opened 8 months ago
The setting is here: https://github.com/substratusai/lingo/blob/4350d67aedd7871c8397f27d4c6ab6c2d79e4865/main.go#L78
In this case, 1
is the concurrency setting.
I have already changed the default from 1 to 100, but what's left is having this configurable through the annotation and reconciling on this as needed
Do I understand this correct, that you suggest a new annotation on the model deployment so that instead of a global value in lingo main this can be customized on the model level? This sounds very reasonable to me.
The deployment manager receives updates on Reconcile
and could trigger a queue resize on the instance
@alpe Yes that's correct. There should be a default global value. In addition, each deployment should be able to override the default global value by setting an annotation.
Currently it seems Lingo is quickly creating more replicas as requests are incoming while the pod isn't ready to serve yet. It should be configurable how many requests a single pod can handle concurrently.
This could be done by using the following annotation in the deployment:
In this case Lingo should only scale up when a single pod is handling more than 100 HTTP requests in parallel. I think a good default value is 100 which is also what knative uses: https://knative.dev/docs/serving/autoscaling/concurrency/#soft-versus-hard-concurrency-limits