substratusai / kubeai

Private Open AI on Kubernetes
https://www.kubeai.org
Apache License 2.0
451 stars 38 forks source link

Limit request queue to fail fast #50

Open alpe opened 10 months ago

alpe commented 10 months ago

Incoming requests are queued in memory until capacity on a serving backend becomes available. This can be critical in a peak load or DoS scenarios. Instead of having this unbound, we should fail fast and reject new requests with StatusServiceUnavailable (503). The total queue limit could be dynamic and/or fix value (due to memory limitations).

For dynamic calculations: factor * total_number_of_replicas * concurrent_requests_per_replica . The factor should be defined in context of the time required to scale up instances. I think, I saw 10x somewhere in a similar project but I can not find the number now. Would be a good start parameter to costumize for different environments.

samos123 commented 10 months ago

I guess if you have large requests and provide Lingo as a public service this would be a real concern. Let's assume each lingo instance can have 60k open connections max and each request is 1 MB then you would need 60GB of memory to hold those requests. Someone that runs a large public Lingo instance might have other DDoS protections in-place on top of Lingo and in that case wouldn't need this feature (e.g. an API gateway or other software that includes such protection).

My vote would be to postpone this until we have a user that runs Lingo on a public endpoint. I am not against including this though. @nstogner your thoughts?

If you are implementing this, I would want a default of unlimited or a number so large that a user with plenty memory and no malicious actors (e.g. internal lingo) wouldn't encounter an error.