TURN connection breaks when the backend pod enters graceful shutdown

l7mp / stunner

A Kubernetes media gateway for WebRTC. Contact: info@l7mp.io

https://l7mp.io

MIT License

751 stars 58 forks source link

TURN connection breaks when the backend pod enters graceful shutdown #138

Closed rg0now closed 6 months ago

rg0now commented 7 months ago

Description

Client's TURN connections should remain alive after the baxckend pod they are talking to via STUNner enters graceful shutdown. However, currently connections break as soon as the backend pod status becomes Terminating.

Steps to Reproduce

Fire up the UDP greeter example and add a sleep to the preStop hook to simulate graceful shutdown:

apiVersion: apps/v1
kind: Deployment
metadata:
 annotations:
 labels:
   app: media-plane
 name: media-plane
spec:
 replicas: 1
 selector:
   matchLabels:
     app: media-plane
 template:
   metadata:
     labels:
       app: media-plane
   spec:
     containers:
     - args:
       - -d
       - -d
       - udp-l:9001,fork
       - EXEC:"echo Greetings from STUNner!"
       command:
       - /usr/bin/socat
       image: l7mp/net-debug:0.5.3
       lifecycle:
         preStop:
           exec:
             command:
             - sleep
             - "300"
       name: net-debug
     terminationGracePeriodSeconds: 300

Create a client connection:

export IPERF_ADDR=$(kubectl get pod -l app=media-plane -o jsonpath="{.items[0].status.podIP}")
turncat --log=all:TRACE - 'k8s://stunner/udp-gateway:udp-listener' udp://$IPERF_ADDR:9001
Hi
Greetings from STUNner!
...

Kill the udp-greeter pod but make sure to keep turncat open:
```
kubectl scale deployment media-plane --replicas=0
```
Sending anything on the turncat connection silently fails.

Expected behavior: The turncat connection should remain functional until the the backend pod finishes graceful shutdown.

Actual behavior: The connection breaks and the stunnerd logs show the below:

server.go:39: turn DEBUG: Received 8 bytes of udp from 10.244.0.1:6078 on [::]:3478
server.go:49: turn DEBUG: Received DataPacket from 10.244.0.1:6078
turn.go:356: turn DEBUG: Received ChannelData from 10.244.0.1:6078
cluster.go:180: stunner-cluster-stunner/media-plane TRACE: Match: cluster "stunner/media-plane" of type STATIC, peer IP: 10.244.0.11
cluster.go:186: stunner-cluster-stunner/media-plane TRACE: route: STATIC cluster with 1 endpoints
cluster.go:189: stunner-cluster-stunner/media-plane TRACE: considering endpoint "10.97.78.139"
server.go:202: turn ERROR: Failed to handle datagram: unable to handle ChannelData from 10.244.0.1:6078: failed writing to socket: peer port administratively prohibited

Versions

Latest master. The regression must have appeared during the stunnerd media routing rewrite.

rg0now commented 7 months ago

This ends up being two related bugs, both occurring due to that we immediately remove the pod IP from the stunnerd config when the pod enters the terminating state, instead of waiting until it finally really terminates. This then causes the below problems:

Clients can no longer refresh existing permissions to terminating pods. This usually happens 1-5 mins after pod shutdown starts.
Sending packets on existing connections to terminating pods immediately fail with "peer port administratively prohibited" error.

The takeaway is that fixing this on the stunnerd side would be more difficult than expected.

rg0now commented 7 months ago

Further investigations: turns out the problem is that we're still using the old Endpoints API for backend pod discovery, which does not consider terminating pods, in contrast to the modern EndpointSlice API, which does.

By default, Kubernetes removes all pod IPs from the Endpoints object that belong to "Terminating" backend pods. For the above example, after graceful shutdown starts we get and empty Endpoints resource:

apiVersion: v1
kind: Endpoints
metadata:
  name: media-plane
  namespace: default
  labels:
    app: media-plane

Since we use the endpoint IPs in this object for permission request handlers and for port-range filtering, we immediately break all existing connections to "Terminating" pods.

The EndpointSlice API, however, returns all the pod IPs, containing the Terminating ones as well:

apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: media-plane-qzbrt
  namespace: default
  labels:
    app: media-plane
    kubernetes.io/service-name: media-plane
addressType: IPv4
ports:
- name: ""
  port: 9001
  protocol: UDP
endpoints:
- addresses:
  - 10.244.0.3
  conditions:
    ready: false
    serving: true
    terminating: true
  nodeName: stunner

Observe the endpoint IP 10.244.0.3 with terminating: true: that's our terminating backend pod.

So the solution is to rewrite the gateway operator from the Endpoints API to the EndpointSlice API and then add the pod IPs for terminating pods to the list of permitted endpoints.

rg0now commented 7 months ago

Addendum: with 2 "ready" and one "terminating" pods:

media-plane-55658cb4f5-hdw6c   1/1     Running       0          10.244.0.14   
media-plane-55658cb4f5-pjp9c   1/1     Terminating   0          10.244.0.12   
media-plane-55658cb4f5-vjvnz   1/1     Running       0          10.244.0.13

We get this EndpointSlice:

apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: media-plane-qzbrt
  namespace: default
  labels:
    app: media-plane
    kubernetes.io/service-name: media-plane
ports:
- name: ""
  port: 9001
  protocol: UDP
addressType: IPv4
endpoints:
- addresses:
  - 10.244.0.12
  conditions:
    ready: false
    serving: true
    terminating: true
  nodeName: stunner
  targetRef:
    kind: Pod
    name: media-plane-55658cb4f5-pjp9c
    namespace: default
    uid: 277d19b4-f5ab-4616-8024-4deebca8f7e9
- addresses:
  - 10.244.0.13
  conditions:
    ready: true
    serving: true
    terminating: false
  nodeName: stunner
  targetRef:
    kind: Pod
    name: media-plane-55658cb4f5-vjvnz
    namespace: default
    uid: 965ae9b0-13c8-4145-a850-533b857876af
- addresses:
  - 10.244.0.14
  conditions:
    ready: true
    serving: true
    terminating: false
  nodeName: stunner
  targetRef:
    kind: Pod
    name: media-plane-55658cb4f5-hdw6c
    namespace: default
    uid: 02a3e3e2-b83c-442c-afe4-4a7ad647bf19

rg0now commented 6 months ago

I've tested this and the issue can no longer be reproduced with the new STUNner dev version that uses the EndpointSlice controller.

Fire up the UDP greeter example again but set terminationGracePeriodSeconds: 300 in the media-plane deployment.

Create a turncat tunnel and create a TURN allocation:

export IPERF_ADDR=$(kubectl get pod -l app=media-plane -o jsonpath="{.items[0].status.podIP}")
turncat --log=all:TRACE - 'k8s://stunner/udp-gateway:udp-listener' udp://$IPERF_ADDR:9001   
Hi
Greetings from STUNner!
...

Scale the media-plane deployment down to 0 pods:

kubectl scale deployment media-plane --replicas=0

This will trigger the UDP greeter pod to enter into a TERMINATING state:

kubectl get pods
NAME                           READY   STATUS        RESTARTS   AGE
media-plane-55658cb4f5-d7h2l   1/1     Terminating   0          91s

And the turncat tunnel stays open:

...
Hi again after terminate
Greetings from STUNner!
And the connection remains open, isn't it?
Greetings from STUNner!
...

marcportabellaclotet-mt commented 5 months ago

I am facing the same issue using the latest dev image (stunnerd:dev) I am using stunner in AWS with netbird to manage a network mesh. When I delete one of the pods there is some downtime. I am running a simple ping test, and some connections are lost.

Would it be possible to add lifecycle prestop policy in the gateway pods created via the dataplane template?

apiVersion: stunner.l7mp.io/v1
kind: Dataplane
metadata:
  annotations:
    meta.helm.sh/release-name: stunner-gateway-operator
    meta.helm.sh/release-namespace: stunner-system
  labels:
    app.kubernetes.io/managed-by: Helm
  name: default
spec:
  args:
  - -w
  - --udp-thread-num=16
  command:
  - stunnerd
  enableMetricsEndpoint: false
  hostNetwork: false
  image: docker.io/l7mp/stunnerd:dev
  imagePullPolicy: Always
  replicas: 2
  resources:
    limits:
      cpu: 2
      memory: 512Mi
    requests:
      cpu: 500m
      memory: 128Mi
  terminationGracePeriodSeconds: 3600
-----
  lifecycle:
     preStop:
        exec:
          command: ["/bin/busybox","sleep","5"]
----

Not sure if this is the cause, but in EKS endpoints are not renewed fast enough and this causes downtime on services. I usually fix this by adding a prestop sleep of 5 seconds.

I am aware that stunnerd image does not have a sleep command, but this is easy to fix.

Meanwhile I will try kyverno to mutate pods at creation time to check if adding a lifecycle has a benefit.

rg0now commented 5 months ago

I'm not hundred percent sure I get the problem right. Do you terminate a backend pod and this makes your connections break, or do you terminate a stunnerd pod and then this is what causes your connections to break?

My understanding is that it's case (2). Now, the question is, why are you killing your stunnerd pods? You're scaling down a STUNner Gateway? If you're doing this right then this should never make STUNner disconnect any live connections: stunnerd implements the graceful shutdown pattern so as long as there are live TURN allocations it should just keep running, at least for an hour as per terminationGracePeriodSeconds: 3600 setting. If stunnerd goes away before that and this breaks active connections, then that's a bug. If you're force-killing stunnerd for another reason then we'd like to know why before adding support for your use case.

NB: In retrospect it was a poor decision to limit the Dataplane spec to a few semi-random params we thought were important at that time. We're spending too much time adding stuff back from the Deployment spec to support specific use cases. We should have supported the full Deployment spec as is and then call it a day...

marcportabellaclotet-mt commented 5 months ago

Thanks for fast response. I terminate the stunnerd pod I have 3 replicas running of udp gateway to be in HA mode. To simulate a k8s node downscale or pod rescheduling, I manually run a kubectl delete pod of one of the running replicas. I do not execute a force delete, just a standard kubectl delete pod name.

I was able to add a mutation policy to add lifecycle prestop hook:

spec:
  containers:
  - args:
    - -w
    - --udp-thread-num=16
    command:
    - stunnerd
    env:
    - name: STUNNER_ADDR
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: STUNNER_NAME
      value: udp-gateway
    - name: STUNNER_NAMESPACE
      value: stunner
    - name: STUNNER_CONFIG_ORIGIN
      value: http://10.120.149.224:13478
    image: marcportabellaclotet/stunner-sleep:latest
    imagePullPolicy: Always
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/busybox
          - sleep
          - "5"

After adding the lifecycle, I am able to kubectl delete the pod, and the connection does not break,

ping output without lifecyle:

64 bytes from 192.168.130.100: icmp_seq=921 ttl=63 time=202 ms
64 bytes from 192.168.130.100: icmp_seq=922 ttl=63 time=110 ms
From 192.168.8.1 icmp_seq=930 Destination Host Unreachable
From 192.168.8.1 icmp_seq=931 Destination Host Unreachable
From 192.168.8.1 icmp_seq=933 Destination Host Unreachable
From 192.168.8.1 icmp_seq=934 Destination Host Unreachable
64 bytes from 192.168.130.100: icmp_seq=935 ttl=63 time=4656 ms
64 bytes from 192.168.130.100: icmp_seq=936 ttl=63 time=3651 ms

Adding the 5 second sleep avoids the downtime. Note that I am using AWS NLB, which may take some little time to reconfigure the endpoints.

This is the config I am using:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: udp-gateway
  namespace: stunner
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"  # For NLB
    service.beta.kubernetes.io/aws-load-balancer-subnets: "subnet-yyy,subnet-xxx,subnet-zzz"  # Replace with your subnet IDs
    service.beta.kubernetes.io/aws-load-balancer-internal: "false"  # Optional: For internal load balancer
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: HTTP
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/live"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "8086"
spec:
  gatewayClassName: stunner-gatewayclass
  listeners:
    - name: udp-listener
      port: 3478
      protocol: TURN-UDP
---
apiVersion: stunner.l7mp.io/v1
kind: UDPRoute
metadata:
  name: stunner-headless
  namespace: stunner
spec:
  parentRefs:
    - name: udp-gateway
  rules:
    - backendRefs:
        - name: udp-gateway
          namespace: stunner

rg0now commented 5 months ago

I'm afraid you're using graceful shutdown incorrectly. The idea is that as long as there are active TURN allocations STUNner will refuse to shut down even if the stunnerd pod is deleted (at least until the terminationGracePeriod passes). If, however, there are no active allocations, there is no harm in removing a stunnerd pod so it will terminate immediately on being deleted. Therefore graceful shutdown cannot be tested with ping: you have to actually create a TURN allocation (say, using the iperf test) and then kill the offending stunnerd pod. You should see your connection remaining open, despite the stunnerd pod entering the TERMINATE state. In this case, there's no reason to add a preStop hook. If this is not what's happening, please file an issue because that's a serious bug we have to investigate and fix.

marcportabellaclotet-mt commented 5 months ago

I believe the core issue isn't that the current connection is broken. The problem arises during traffic redirection, where there's a race condition between the removal of endpoint IPs and the traffic being sent to them. In our scenario, affecting not just Stunner but other Kubernetes services as well, the Network Load Balancer (NLB) attempts to direct traffic to pods that are in the process of gracefully shutting down. As a result, for a few seconds, new traffic is directed to pods that have already received the SIGTERM signal and are no longer accepting new requests.

rg0now commented 5 months ago

Now we're talking!...:-) Interesting. Is this AWS/EKS?

Anyway, by design STUNner does accept new connections during graceful shutdown exactly to address buggy/slow load balancers. This makes the prestop hook unnecessary. Do you have actual tests where this fails?

marcportabellaclotet-mt commented 5 months ago

Thanks for looking into this. I am not able to reproduce the issue any more. Even grace-deleting all the running stunnerd pods, the connection persists stable.

rg0now commented 5 months ago

We're so grateful you tested this! Graceful shutdown is one of the most critical features a cloud-native software must provide but getting this right, especially for gateways, is quite difficult. STUNner's graceful shutdown codepath was broken for a long time and I'm so happy we could finally restore this useful feature before v1!

marcportabellaclotet-mt commented 5 months ago

We will perform some intensive testing in the next days. I will share feedback. Thanks