microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps
MIT License
371 stars 29 forks source link

How is connection draining supported in Azure Container Apps? #493

Open arisewanggithub opened 1 year ago

arisewanggithub commented 1 year ago

I didn't see related info about this in the public doc. How is connection draining supported?

  1. In single revision mode, I only see health probe for new revision. After a new revision can take traffic, all traffic is moved to the new revision and the old revision is deactivated immediately. For requests that the old revision is still processing, will they just be abandoned? When use AKS directly, we can set connection draining in app gateway. Is there similar support in Azure Container Apps?
  2. In multiple revision mode, I think we can spin up a new revision, after it's ready, allow 100% traffic to the new revision and then wait for several minutes and then deactivate the old revision. But the same question remains. After traffic is allocated to the new revision, can the requests the old revision is still processing still return to client successfully?
ahmelsayed commented 1 year ago

In single revision mode, once the new revision passes its probes, traffic is moved to it and the old revision is allowed 30 seconds before it's asked to shutdown. Once it receives the shutdown signal (SIGTERM) it can trap that signal for an additional 30 seconds. But if the app doesn't trap SIGTERM, it'll have 30 seconds to finish active requests.

After traffic is allocated to the new revision, can the requests the old revision is still processing still return to client successfully?

Active requests are not terminated from the loadbalancer side. So yes, existing active requests will remain uninterrupted after the traffic switch.

arisewanggithub commented 1 year ago

I see. Thanks! So, if my understanding is correct, in single revision mode, if our app can handle SIGTERM, it can have at most 1 min to handle requests before it's shutdown. Is there any config that can be used to extend the time? Currently we are using AKS and using 120s as the grace period. Another problem is, we have another container as agent to upload log. in AKS, it's a standalone container running on the same node. So, when the main service is deployed, it won't interrupt the log/metric collection of that service. But with Azure Container Apps, we can only deploy it in the same replica as the main container as the communication between the two containers is through UDP and domain socket (are domain socket and UDP supported as a communication way between two containers in the same replica?). That log upload agent is not owned by us and it does aggregation inside and may take several minutes to flush all the logs/metrics. In order for this to work correctly, 1 min doesn't seem enough. Or multiple revision mode is the only way we can use? We don't have the requirement to run multiple revisions side by side. If possible, single revision mode is our preferred way as it's simple to use. Another question is, when you say "In single revision mode, once the new revision passes its probes, traffic is moved to it", suppose the old revision currently scaled up to 10 replicas, even if only 1 replica is ready in the new revision, I guess the health probe will pass, right? In this case, will all the traffic be moved to this single replica without waiting for the other 9 replicas to be ready? Or will it wait for all 10 replicas to be ready and then move traffic?

ahmelsayed commented 1 year ago

So, if my understanding is correct, in single revision mode, if our app can handle SIGTERM, it can have at most 1 min to handle requests before it's shutdown.

Correct.

Is there any config that can be used to extend the time? Currently we are using AKS and using 120s as the grace period.

Not at the moment, no.

(are domain socket and UDP supported as a communication way between two containers in the same replica?)

UDP/TCP should be fine. For unix sockets, you'll need to mount a shared tmp directory between the 2 containers (see https://learn.microsoft.com/en-us/azure/container-apps/storage-mounts?pivots=aca-arm#temporary-storage)

In order for this to work correctly, 1 min doesn't seem enough. Or multiple revision mode is the only way we can use? We don't have the requirement to run multiple revisions side by side. If possible, single revision mode is our preferred way as it's simple to use.

Yes, it seems that you'd need either a way to customize the termination grace period to allow for more than 1 minute, or use multiple revision mode to have control over when to shutdown the revision. only the latter is possible today, but we can track exposing the termination grace period for scenarios like this.

In this case, will all the traffic be moved to this single replica without waiting for the other 9 replicas to be ready? Or will it wait for all 10 replicas to be ready and then move traffic?

Correct, each revision has its own scale meters/counters. The http counters are not inherent or shared between revisions, so it'll go through its own scaling flow. I agree that it'd work better if the new revisions come up with the scale counters of the previous revision

arisewanggithub commented 1 year ago

Got it. Thanks a lot for the answers! Looks like multiple revision is our way to go. As the new revision doesn't scale up automatically to match the old revision replica count before traffic is moved, it might cause an issue when a new revision is created since the new revision might not be able to handle all the traffic in the old revision. In AKS, this is not an issue as new pods can take traffic together with old pods during deployment and traffic is moved to new pods gradually. But with Azure Container Apps, this will be an issue. A workaround I can think of is to change the min replica count with the new revision and change it back later. Or gradually move more traffic to the new revision. But this makes it not easy to use. So, one suggestion I can make is to add a parameter for desired replica count when create a new revision and in single revision mode, before the desired replica count is reached, traffic is not moved, in multiple revision mode, users can manually move the traffic after the replica count is reached without changing the min replica count.

clarenceb commented 1 year ago

What about preStop lifecycle hook? I understand we can handle SIGTERM in our app code or in a shell script as described above. Are there plans to support container hooks? Lifecycle hooks are more explicit and discoverable (i.e. you don't need to go looking at the container Dockerfile/scripts) and they support exec and http handlers.

tasdflkjweio commented 9 months ago

it's been nearly a year on this -> is there any update? this functionality seems pretty useful