kumahq / kuma

🐻 The multi-zone service mesh for containers, Kubernetes and VMs. Built with Envoy. CNCF Sandbox Project.
https://kuma.io/install
Apache License 2.0
3.61k stars 331 forks source link

Graceful shutdown prevents outgoing requests #4720

Closed michaelbeaumont closed 5 months ago

michaelbeaumont commented 2 years ago

What happened?

We are in a regime, where an horizontal pod autoscaler (HPA via KEDA) up- or down-scales the amount of pods. In the down-scaling case, which is relevant here, some pod is picked and terminated - irrespective of that pod's state of processing. Computations might be fully in flight while the termination signal is received. The pod will eventually terminate, all we can do is try to complete current work. But in order to successfully do that, we need to communicate updates (on channels not necessarily previously established). From my point of view, kuma-dp (and Envoy) should just continue processing and terminate last. (But I do understand that "terminate last" is a challenge. See me above suggestions for that.)

Originally posted by @zd9KgA in https://github.com/kumahq/kuma/issues/4229#issuecomment-1201395505

See conversation in that PR

jakubdyszkiewicz commented 2 years ago

Triage: @zd9KgA do you still see this in the newest version of Kuma? https://kuma.io/docs/1.7.x/explore/dpp-on-kubernetes/#lifecycle

zd9KgA commented 2 years ago

Hi @jakubdyszkiewicz: The description looks promising. Is there an image at docker hub that I could use to test?

michaelbeaumont commented 2 years ago

@zd9KgA it's already present in 1.7.x

zd9KgA commented 2 years ago

Hi @michaelbeaumont: Understood. As I have no infrastructure in place to build the images myself, I am wondering whether images like [1.8.1-preview.8b4194ea6](https://hub.docker.com/layers/kuma-dp/kumahq/kuma-dp/1.8.1-preview.8b4194ea6/images/sha256-53d88df0d636a454d2717a34858df1e3f5bf6eaa9e5dd47335fcb9ddc4ab529f?context=explore) has this feature included.

michaelbeaumont commented 2 years ago

Yeah they do.

zd9KgA commented 2 years ago

I'll have a try.

zd9KgA commented 2 years ago

What I can report as of now is that the computational flow in my experiments was disrupted multiple times. I'm just rerunning the experiment with KEDA ScaledJob (--> Job) as opposed to KEDA ScaledObject (--> Deployment). In case the second experiment completes without disruption, then the suspicion would be that the scale down of deployment(s) and the corresponding termination of pods is the cause. My problem is that we do not have enough visibility of what happens / fails during tear-down. ... I'll come back when the second experiment has completed.

zd9KgA commented 2 years ago

Hi! The switch-over to ScaledObject(s) did not work. But as apparently even the connection to fluentd is lost, I lost visibility. This is currently being reestablished by setting traffic.kuma.io/exclude-outbound-ports appropriately. With that set, I hope to get more insight into what exactly goes wrong. I'll keep you posted.

I've moved to the present state of the git branch release-1.8 and pick up the default Kuma images for that release.

michaelbeaumont commented 2 years ago

@zd9KgA Did you get a chance to take a closer look?

zd9KgA commented 2 years ago

Hi @michaelbeaumont, In parts: I haven't gained detailed insight into the timing / sequence of events, but I can assert that the behavior is not as expected. Once the pod is marked terminating as a result of scaling down, some of the intended communication is blocked. This is true, for example, for Fluentd, Redis, and PostgreSQL. If the corresponding ports are excluded (via traffic.kuma.io/exclude-outbound-ports), then such problems do not arise. Unfortunately, this is not a true test set-up, but "merely" observations from our test systems for other purposes. I wasn't able to pinpoint from what point in time onwards connections are no longer possible. The earliest cases are within single digit number of seconds following the termination request. (I had hope to find some time to analyze this in more detail, but haven't managed so far.) Is there any particular observation that I could try to make in order to help understanding what is causing the unexpected behavior?

zd9KgA commented 2 years ago

P.S. All cases I observe relate to ExternalService - in case this is of any relevance.

github-actions[bot] commented 1 year ago

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant, please comment on it or attend the next triage meeting.

zd9KgA commented 1 year ago

Hi @michaelbeaumont: I repeated my tests based on the version 2.0.0. Although I did get the "Kuma observability" in place, I did not manage to significantly deepen my sights. The problems I highlighted earlier do persist, though: The same workload wrapped into KEDA scaled jobs runs smoothly, but occasionally fails as KEDA scaled object. The traces of the problems as in line with what is described in https://kuma.io/docs/2.0.x/explore/dpp-on-kubernetes/#lifecycle. Upon scaling down the replications of the workload (when using scaled objects), some of the replicas are requested to initiated termination. In reaction to this, the kuma-dp initiates draining. The workload depends on outgoing connectivity being available without any restriction until after it has completed its work. My understanding from what I see happening and what I read from the documentation is that this assumption is not in line with how kuma-dp/envoyproxy act. My understanding is furthermore that kuma-dp would hardly be able to generically observe "when the workload is done". Correct? If so, we'd stick to scaled jobs for the workload type in question, which does not face SIGTERM when scaling down, but relies on the workload to run to completion an then to terminate itself. (We do have mitigating measures in place if a workload passes away for whatever reason (retries, something like exception handling), but this causes way too much latency to fall back on for regular processing.) Cheers!

github-actions[bot] commented 1 year ago

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant, please comment on it or attend the next triage meeting.

github-actions[bot] commented 1 year ago

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant, please comment on it or attend the next triage meeting.

github-actions[bot] commented 10 months ago

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant, please comment on it or attend the next triage meeting.

github-actions[bot] commented 7 months ago

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant, please comment on it or attend the next triage meeting.

slonka commented 7 months ago

@michaelbeaumont please tak a look at the comment and respond when you have time https://github.com/kumahq/kuma/issues/4720#issuecomment-1380542759

michaelbeaumont commented 6 months ago

@zd9KgA Are you able to test with Kubernetes v1.29? We merged support for the new sidecar features container recently. It's not in a released version yet but with the latest preview, for example:

curl --silent -L https://kuma.io/installer.sh  | VERSION=preview sh

it's supported. It can be enabled with the env var KUMA_EXPERIMENTAL_SIDECAR_CONTAINERS=true or the config experimental.sidecarContainers: "true". It'd be great to get your feedback!

zd9KgA commented 6 months ago

I have a v1.29.2 in place and will give the preview a try. Expect feedback by the end of the week.

zd9KgA commented 6 months ago

@michaelbeaumont Looks good to me. I tested with images in two versions, 0.0.0-preview.vbf6ad97da was the second one. Both worked fine. I used the Helm charts from origin/master with some local changes.

I found three changes of the Helm Chart helpful:

  1. deployments/charts/kuma/templates/_helpers.tpl:
    @@ -269,6 +269,10 @@ env:
    - name: KUMA_EXPERIMENTAL_GATEWAY_API
    value: "true"
    {{- end }}
    +{{- if .Values.experimental.sidecarContainers }}
    +- name: KUMA_EXPERIMENTAL_SIDECAR_CONTAINERS
    +  value: "true"
    +{{- end }}
    {{- if .Values.cni.enabled }}
    - name: KUMA_RUNTIME_KUBERNETES_NODE_TAINT_CONTROLLER_ENABLED
    value: "true"
  2. deployments/charts/kuma/templates/cp-rbac.yaml:
    @@ -126,6 +126,7 @@ rules:
       - zoneegresses
       - zoneegressinsights
       - meshinsights
    +      - meshservices
       - serviceinsights
       - proxytemplates
       - ratelimits
  3. I used a ContainerPatch to adjust sidecar Probe settings for faster startup, but just found out that the values in kuma/docs/generated/raw/kuma-cp.yaml almost matched my final settings. These aren't the ones that were set for the sidecar in my initial experiments (otherwise I wouldn't have changed them), but I did not care to trace where the difference came from.
michaelbeaumont commented 6 months ago

Thanks! That's all really helpful. Indeed the switch was missing from the chart. 2) is unrelated but a good catch (EDIT: ok it's because we didn't have a preview Helm chart). I'll check that the probe is working as expected

github-actions[bot] commented 6 months ago

Removing closed state labels due to the issue being reopened.

michaelbeaumont commented 5 months ago

Closing this. Note we have also introduced graceful draining of requests, but only for inbounds.