Rolling restart results in 502 for ~minute from LoadBalancer

Tyrion85 commented 1 year ago

GKE version v1.24.12-gke.1000, us-west1 regional cluster

Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: app
      app.kubernetes.io/name: frontend
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
      labels:
        app.kubernetes.io/instance: app
        app.kubernetes.io/name: frontend
    spec:
      containers:
      - env:
        - name: ...
           value: ...
        image: ${some-image}
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
              - sh
              - -c
              - sleep 20
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        name: frontend
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 500m
            memory: 300Mi
          requests:
            cpu: 500m
            memory: 300Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          privileged: false
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp
          name: tmp
      dnsPolicy: ClusterFirst
      nodeSelector:
        app-pool: "true"
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 2000
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 1000
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 120
      tolerations:
      - effect: NoSchedule
        key: app-pool
        operator: Equal
        value: "true"
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/instance: app
            app.kubernetes.io/name: frontend
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
      volumes:
      - emptyDir: {}
        name: tmp

Service:

apiVersion: v1
kind: Service
metadata:
  name: app
spec:
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/instance: app
    app.kubernetes.io/name: frontend
  sessionAffinity: None
  type: ClusterIP

Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: gce
    networking.gke.io/v1beta1.FrontendConfig: redirect-to-https
  name: app
spec:
  rules:
  - host: ...
    http:
      paths:
      - backend:
          service:
            name: app
            port:
              number: 80
        path: /
        pathType: Prefix
  tls:
  - hosts:
    - ....
    secretName: wildcard

Whenever there is a rolling restart, there is a decent chance 502 is returned to clients. A couple of months ago, it was sporadic - these days its every time. During that time, pods are healthy from the standpoint of Kubernetes (restart of the second port wouldn't start if the first one was not marked as ready). During the restart, healtchecks in GCP UI sometimes show unhealthy NEGs, even though the application is responding to checks (same checks as those for kubelet).

502 is returned even when only one pod is replaced - the second old pod is still "alive", and yet LB returns 502. As if the NEG is already destroyed.

After some unspecified time (15s-60s), NEGs seem to be created/attached correctly and proper response 200 Is returned to clients.

Service events don't have anything useful:

 Normal  Attach  33m                neg-controller  Attach 2 network endpoint(s) (NEG "xxx-8-3068cb8a" in zone "us-west1-c")
  Normal  Attach  24m                neg-controller  Attach 1 network endpoint(s) (NEG "xxx-8-3068cb8a" in zone "us-west1-b")
  Normal  Attach  24m                neg-controller  Attach 1 network endpoint(s) (NEG "xxx-8-3068cb8a" in zone "us-west1-a")
  Normal  Detach  24m (x2 over 24m)  neg-controller  Detach 1 network endpoint(s) (NEG "xxx-8-3068cb8a" in zone "us-west1-c")

Events in pods:

 Normal  LoadBalancerNegNotReady  27m   neg-readiness-reflector  Waiting for pod to become healthy in at least one of the NEG(s): [xxx-8-3068cb8a xxx-8-69e27b09[]
  Normal  Scheduled                27m   default-scheduler        Successfully assigned xxx to yyy
  Normal  Pulled                   27m   kubelet                  Container image "cr.l5d.io/linkerd/proxy-init:v2.0.0" already present on machine
  Normal  Created                  27m   kubelet                  Created container linkerd-init
  Normal  Started                  27m   kubelet                  Started container linkerd-init
  Normal  Pulled                   27m   kubelet                  Container image "cr.l5d.io/linkerd/proxy:stable-2.12.2" already present on machine
  Normal  Created                  27m   kubelet                  Created container linkerd-proxy
  Normal  Started                  27m   kubelet                  Started container linkerd-proxy
  Normal  Pulling                  27m   kubelet                  Pulling image "xxx"
  Normal  LoadBalancerNegReady     26m   neg-readiness-reflector  Pod does not belong to any NEG. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.
  Normal  Pulled                   26m   kubelet                  Successfully pulled image "xxx" in 1m3.44813649s
  Normal  Created                  26m   kubelet                  Created container ...
  Normal  Started                  26m   kubelet                  Started container ...

How to even approach debugging this issue? Is it a race condition somewhere, and the issue with timeouts in healthchecks? From the standpoint of Kubernetes, everything is fine.

Looks same-ish as https://github.com/kubernetes/ingress-gce/issues/1718

I opened another issue for visibility, as this is literally a show-stopper in using this product. Intermittent downtime on every deployment is simply not acceptable in my use-case.

aojea commented 1 year ago

Kubernetes implemented a new feature that allow to have zero downtime during rolling updates, it is called TerminatingEndpoints and was implemented in 1.26 https://kubernetes.io/blog/2022/12/30/advancements-in-kubernetes-traffic-engineering/

You can find this presentation that explains the internal details

https://docs.google.com/presentation/d/1DRZ6PFJHPq7Z3av0HyHVCjeXJ_v5xYYleq2Q4IZStDk/edit?usp=sharing

basically your pods has to handle the SIGTERM signal to graceful terminate and enable the loadbalancers to remove the Pod out of the backends rotations before it dies, so new traffic is not blackholed

Tyrion85 commented 1 year ago

@aojea What about the second scenario, where old pod is gracefully terminated and removed, but new pod's NEG is created but not properly attached to LB backend service? See second failure mode described here https://github.com/kubernetes/ingress-gce/issues/1718#issuecomment-1140097570

How does what you posted helps with this case? I distributed the application across the zones, and I still see failures in this way.

swetharepakula commented 1 year ago

@Tyrion85 How often are you seeing this? Typically only the first pod in a new zone will have a long start up time.

One option is to add a preStop hook to the pods so that the old pod stays on longer and gives the new pod more time to come up.

Tyrion85 commented 1 year ago

@swetharepakula we've started seeing this on every rolling restart now. Previously (no hard data, but ~until two weeks ago), these were only sporadic, hardly even noticable. There was no observable change on our side that would cause this, so my feeling is something is wrong internally on Google's side (as Google manages both NEG and Ingress Controllers), for these to become so frequent lately in our region/cluster.

preStop hooks are actually in place - it was 20s initially (looking at internal git history, it has been 20s for year+), bumped to 60s recently, but no dice. I heavily edited above manifest to remove sensitive info, must have deleted preStop hook from example manifest by mistake, my bad. The preStop hook config is classic sleep, for 60s. App is just a basic express server, absolutely nothing fancy there. Has been working fine for years with GCE Ingress.

From what I noticed today (after scaling to 3 replicas spread across zones, and pinging my endpoint while the restart is in progress), is that 502s are returned only when last old pod is in Terminating state, with three new pods in Running state. This tells me that backend is without serving pods, even though three pods are created and running (in the span of ~3 minutes). This seems super slow for Ingress Controller to not attach NEGs in 3 minutes, wouldn't you agree? Yet this is what happens, and has been happening for the last two weeks.

Cluster is very small, 30+ nodes with autoscaler, GKE stable release, but we do use Ingress GCE a lot comparatively (20+ Ingresses, and thus 20+ LoadBalancers in this GCP project). Maybe some issue with scaling for Ingress Controller?

As it stands right now, our best option seems to be to use some third-party Ingress Controller (such as nginx or Emissary-ingress), which itself will be with NEG (we have nginx ingress controller with GCP HTTPS LoadBalancer deployed actually, but don't really use it). As those are restarted fairly infrequently, at the price of one additional internal network hop we potentially gain a lot in stability.

swetharepakula commented 1 year ago

@Tyrion85 are you able to see that the endpoints are not in the NEG when the 502s are returned? You can look at the NetworkEndpointGroup cloud audit logs to see all the NEG related operations.

I would also recommend opening up a support ticket so we can investigate your case closer.

keith-miller commented 1 year ago

I'm in us-west2 and seeing the same behavior. I'm on the basic plan so I'll just be watching this ticket :)

Tyrion85 commented 1 year ago

@swetharepakula Opening the support ticket seems like the only option, at this point. I don't think this even has much to do with Ingress controller per se. These two images are taken in the span of a couple of seconds, with Nginx Ingress Controller and HTTPS LoadBalancer with NEG - as you can see, all pods are up, but backend has no healthy endpoints.

This is an extreme degradation of service reliability IMHO. Right now, its been 10 minutes, and NEG controller still hasn't picked up endpoints correctly.

EDIT: After 10 minutes with no action on my part:

IvanUkhov commented 11 months ago

This can potentially be of relevance:

https://cloud.google.com/kubernetes-engine/docs/troubleshooting/troubleshoot-load-balancing#500-series-errors

It explains what might be happening and how to tackle it.

swetharepakula commented 9 months ago

New home for the documentation: https://cloud.google.com/kubernetes-engine/docs/troubleshooting/load-balancing#500-series-errors

@Tyrion85 , to clarify NEG controller does not controller the health status of the endpoints. This is determined by the health check that is set up by the Ingress controller when using Ingress. Are you saying that the endpoints specified in the NEG controller are incorrect or that they aren't healthy immediately? Again a support case is likely the best approach as this is requires a case by case analysis, especially if the prestop hook is not working.

giannello commented 6 months ago

@Tyrion85 is that deployment referenced by more than one Service?

anderspetersson commented 4 months ago

@swetharepakula

https://cloud.google.com/kubernetes-engine/docs/troubleshooting/load-balancing#500-series-errors mentions a step adding a preStop hook to all the containers:

spec:
  containers:
  - name: my-app
    ...
    lifecycle:
      preStopHook:
        exec:
          command: ["/bin/sh", "-c", "sleep 120s"]

The field is named preStop and not preStopHook. https://www.manifests.io/kubernetes/1.30/io.k8s.api.core.v1.LifecycleHandler?linked=Deployment.spec.template.spec.containers.lifecycle.preStop

Copying the code from docs resulting in this error: strict decoding error: unknown field "spec.template.spec.containers[0].lifecycle.preStopHook", unknown field "spec.template.spec.containers[1].lifecycle.preStopHook"

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kubernetes / ingress-gce

Rolling restart results in 502 for ~minute from LoadBalancer #2222