knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.5k stars 1.14k forks source link

Activator will OOM under backpressure #13583

Open Peilun-Li opened 1 year ago

Peilun-Li commented 1 year ago

What version of Knative?

0.22.3

Expected Behavior

Activator should be resilient under backpressure

Actual Behavior

Activator OOM'ed under backpressure

Steps to Reproduce the Problem

  1. Create an example target ksvc to mimic service with limited capacity (due to resource limitation, runtime/dependency bottleneck, or others). For the example here, I created a service with maxReplicas=1, containerConcurrency=2, and for each request it will sleep for 30s before returning a dummy response.
  2. Attack the target ksvc with a throughput higher than it could handle. Example to attack with 200 rps with 1s timeout
    ali --body 'dummy request body' -m POST --duration 7200s --rate 200 --timeout 1s https://ksvc-endpoint-url
  3. Monitor activator resource usage: memory will heat up over time and eventually OOM. Increasing memory will only delay the OOM but not prevent it.

The default 600MB activator deployment OOM'ed after ~15 min.

image

Increased activator deployment memory to 2000MB. OOM'ed after ~50 min.

image

CPU utilization is low all through the period so activator is not scaled up (also not sure if scaling up activator can help here)

Is there any way/setting we can leverage to prevent activator from OOM under such backpressure, which could impact all ksvc in the cluster. Ideally if activator can deny surplus requests to the target ksvc if its queue is already long/full. Or if there's any tool we can borrow from Istio to enforce e.g. rate limiting on a per ksvc level? It looks like we do have a breakerQueueDepth but somehow it's not effective in terms of preventing OOM. Suspecting (might be wrong though) it might be related to this issue https://github.com/golang/go/issues/35407

Thanks for any help!

psschwei commented 1 year ago

I think what's causing the problem is that all data is flowing through the activator initially (can verify by checking if the SKS is in proxy mode).

Usually what would happen with this amount of traffic is the service would scale up, which would remove the activator from the data path (i.e. SKS in serve mode) and send traffic directly to the pod. But because you've set max replicas to one, the service can't scale up and alleviate the pressure.

There's a couple of things you can try in this situation:

One other thing: you've listed knative v0.22.3 as the version you're using. That's been EOLed for over a year now, so I'd recommend upgrading.

Peilun-Li commented 1 year ago

Thanks @psschwei , yes we do want to have activator in the path (even the target already has some capacity) as that turns out to be helpful with long tail performance. And that max_replicas=1 just tries to mimic the production behavior that one service may face bottleneck (e.g., run out of all node resources, or already scaled to a considerably high maximum, or simply that service crashes totally).

We do have a min_replicas=3 for activator deployment, though unfortunately that doesn't help solve the OOM. Actually in this case the 2 activator pods (that are in the request path) both witness OOM. In that sense I'd worry turning activator capacity won't help too as that will only bring OOM to more activator pods.

I think this is a general backpressure problem that could happen even in low traffic scenarios (as the example used here, eventually might be in any case if input throughput > output throughput holds for a considerable long time). I do agree some recommendations here would help alleviate but can't make it a perfect one. Ideally to limit the blast radius under backpressure, the target service could fail but activator should not.

(And yes, we are looking into upgrading knative version we are using)

Peilun-Li commented 1 year ago

In case it helps, here's the plot of activator_request_concurrency reported by one activator pod during the example attack. Looks like it's ever-growing without any cap at 10k or so.

image
Peilun-Li commented 1 year ago

While not sure about the root cause and potential fixes, we'll be trying to upgrade our knative version to see if it helps, also sharing some mitigations we are planning to patch to our system in case it helps (the idea is to transfer a hard OOM kill to a soft high memory failover):

  1. Increase activator deployment memory.
  2. Add additional memory utilization-based HPA policy to scale up activator under high memory utilization. To hope that attack requests would be assigned to another activator pod after activator scaling up.
  3. Have a cronjob to rollout restart activator deployment occasionally.

There's a potential better mitigation but would need some work within the codebase:

  1. Customize activator deployment liveness probe to let it fail when container memory utilization (A) is high, so that K8S will restart the container gracefully.
  2. Add additional memory utilization-based HPA policy to scale up activator under high memory utilization (B). (A) needs to be larger than (B) so that activator pods won't fail liveness probe all at the same time.
  3. Ensure activator deployment memory request=limit for better availability (less chance to be OOMKilled by overcommit).
mbaynton commented 1 year ago

It seems reasonable to me for the activator to start returning 503 service unavailable to additional requests if a large number of requests have accrued for a single revision and the activator's memory usage is approaching its own pod's request.

mbrancato commented 1 year ago

I recently hit this issue. For me, the best fix was setting the target burst capacity to 0 as recommended here. As I understand it, this takes the activator out of the request path very early once a revision has pods ready.

What seemed to be the cause was a combination of a large ingress rate, a maxScale that was relatively low and combined with the target (pods * rps) did not come near the actual rate of requests.

[2266536.931116] activator invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=969

In an attempt to solve this with resources, I doubled our max scale from 20 to 40 activator copies, and increased the memory from 1GiB to 4GiB per pod. It still would died frequently with an OOM, presumably as it queued up requests.

Peilun-Li commented 5 months ago

Update: we have upgraded our knative to 1.7.4 a while ago and are still facing such issues. There's a more extreme failure pattern if the target upstream service is failing to handle the requests (for whatever reason), activator will OOM the same throughout traffic backpressure. We do still want to have activator in the request path for the load balancing benefits, so appreciate it if this loose end could be tightened from within the activator.

akdigitalself commented 1 month ago

Is there an update on this? As mentioned in the previous comment, if a service managed by knative is down, its backpressure builds up very quickly in the activator causing that activator pod to OOM and also impacting the requests queued up in it for other services. We would like to keep activator in the request path, so it would be great if activator can be updated for this.