Closed rg0now closed 6 months ago
This ends up being two related bugs, both occurring due to that we immediately remove the pod IP from the stunnerd
config when the pod enters the terminating state, instead of waiting until it finally really terminates. This then causes the below problems:
The takeaway is that fixing this on the stunnerd
side would be more difficult than expected.
Further investigations: turns out the problem is that we're still using the old Endpoints API for backend pod discovery, which does not consider terminating pods, in contrast to the modern EndpointSlice API, which does.
By default, Kubernetes removes all pod IPs from the Endpoints object that belong to "Terminating" backend pods. For the above example, after graceful shutdown starts we get and empty Endpoints resource:
apiVersion: v1
kind: Endpoints
metadata:
name: media-plane
namespace: default
labels:
app: media-plane
Since we use the endpoint IPs in this object for permission request handlers and for port-range filtering, we immediately break all existing connections to "Terminating" pods.
The EndpointSlice API, however, returns all the pod IPs, containing the Terminating ones as well:
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
name: media-plane-qzbrt
namespace: default
labels:
app: media-plane
kubernetes.io/service-name: media-plane
addressType: IPv4
ports:
- name: ""
port: 9001
protocol: UDP
endpoints:
- addresses:
- 10.244.0.3
conditions:
ready: false
serving: true
terminating: true
nodeName: stunner
Observe the endpoint IP 10.244.0.3 with terminating: true
: that's our terminating backend pod.
So the solution is to rewrite the gateway operator from the Endpoints API to the EndpointSlice API and then add the pod IPs for terminating pods to the list of permitted endpoints.
Addendum: with 2 "ready" and one "terminating" pods:
media-plane-55658cb4f5-hdw6c 1/1 Running 0 10.244.0.14
media-plane-55658cb4f5-pjp9c 1/1 Terminating 0 10.244.0.12
media-plane-55658cb4f5-vjvnz 1/1 Running 0 10.244.0.13
We get this EndpointSlice:
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
name: media-plane-qzbrt
namespace: default
labels:
app: media-plane
kubernetes.io/service-name: media-plane
ports:
- name: ""
port: 9001
protocol: UDP
addressType: IPv4
endpoints:
- addresses:
- 10.244.0.12
conditions:
ready: false
serving: true
terminating: true
nodeName: stunner
targetRef:
kind: Pod
name: media-plane-55658cb4f5-pjp9c
namespace: default
uid: 277d19b4-f5ab-4616-8024-4deebca8f7e9
- addresses:
- 10.244.0.13
conditions:
ready: true
serving: true
terminating: false
nodeName: stunner
targetRef:
kind: Pod
name: media-plane-55658cb4f5-vjvnz
namespace: default
uid: 965ae9b0-13c8-4145-a850-533b857876af
- addresses:
- 10.244.0.14
conditions:
ready: true
serving: true
terminating: false
nodeName: stunner
targetRef:
kind: Pod
name: media-plane-55658cb4f5-hdw6c
namespace: default
uid: 02a3e3e2-b83c-442c-afe4-4a7ad647bf19
I've tested this and the issue can no longer be reproduced with the new STUNner dev
version that uses the EndpointSlice controller.
Fire up the UDP greeter example again but set terminationGracePeriodSeconds: 300
in the media-plane
deployment.
Create a turncat
tunnel and create a TURN allocation:
export IPERF_ADDR=$(kubectl get pod -l app=media-plane -o jsonpath="{.items[0].status.podIP}")
turncat --log=all:TRACE - 'k8s://stunner/udp-gateway:udp-listener' udp://$IPERF_ADDR:9001
Hi
Greetings from STUNner!
...
Scale the media-plane
deployment down to 0 pods:
kubectl scale deployment media-plane --replicas=0
This will trigger the UDP greeter pod to enter into a TERMINATING
state:
kubectl get pods
NAME READY STATUS RESTARTS AGE
media-plane-55658cb4f5-d7h2l 1/1 Terminating 0 91s
And the turncat
tunnel stays open:
...
Hi again after terminate
Greetings from STUNner!
And the connection remains open, isn't it?
Greetings from STUNner!
...
I am facing the same issue using the latest dev image (stunnerd:dev) I am using stunner in AWS with netbird to manage a network mesh. When I delete one of the pods there is some downtime. I am running a simple ping test, and some connections are lost.
Would it be possible to add lifecycle prestop policy in the gateway pods created via the dataplane template?
apiVersion: stunner.l7mp.io/v1
kind: Dataplane
metadata:
annotations:
meta.helm.sh/release-name: stunner-gateway-operator
meta.helm.sh/release-namespace: stunner-system
labels:
app.kubernetes.io/managed-by: Helm
name: default
spec:
args:
- -w
- --udp-thread-num=16
command:
- stunnerd
enableMetricsEndpoint: false
hostNetwork: false
image: docker.io/l7mp/stunnerd:dev
imagePullPolicy: Always
replicas: 2
resources:
limits:
cpu: 2
memory: 512Mi
requests:
cpu: 500m
memory: 128Mi
terminationGracePeriodSeconds: 3600
-----
lifecycle:
preStop:
exec:
command: ["/bin/busybox","sleep","5"]
----
Not sure if this is the cause, but in EKS endpoints are not renewed fast enough and this causes downtime on services. I usually fix this by adding a prestop sleep of 5 seconds.
I am aware that stunnerd image does not have a sleep command, but this is easy to fix.
Meanwhile I will try kyverno to mutate pods at creation time to check if adding a lifecycle has a benefit.
I'm not hundred percent sure I get the problem right. Do you terminate a backend pod and this makes your connections break, or do you terminate a stunnerd pod and then this is what causes your connections to break?
My understanding is that it's case (2). Now, the question is, why are you killing your stunnerd
pods? You're scaling down a STUNner Gateway? If you're doing this right then this should never make STUNner disconnect any live connections: stunnerd
implements the graceful shutdown pattern so as long as there are live TURN allocations it should just keep running, at least for an hour as per terminationGracePeriodSeconds: 3600
setting. If stunnerd
goes away before that and this breaks active connections, then that's a bug. If you're force-killing stunnerd
for another reason then we'd like to know why before adding support for your use case.
NB: In retrospect it was a poor decision to limit the Dataplane spec to a few semi-random params we thought were important at that time. We're spending too much time adding stuff back from the Deployment spec to support specific use cases. We should have supported the full Deployment spec as is and then call it a day...
Thanks for fast response. I terminate the stunnerd pod I have 3 replicas running of udp gateway to be in HA mode. To simulate a k8s node downscale or pod rescheduling, I manually run a kubectl delete pod of one of the running replicas. I do not execute a force delete, just a standard kubectl delete pod name.
I was able to add a mutation policy to add lifecycle prestop hook:
spec:
containers:
- args:
- -w
- --udp-thread-num=16
command:
- stunnerd
env:
- name: STUNNER_ADDR
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: STUNNER_NAME
value: udp-gateway
- name: STUNNER_NAMESPACE
value: stunner
- name: STUNNER_CONFIG_ORIGIN
value: http://10.120.149.224:13478
image: marcportabellaclotet/stunner-sleep:latest
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command:
- /bin/busybox
- sleep
- "5"
After adding the lifecycle, I am able to kubectl delete the pod, and the connection does not break,
ping output without lifecyle:
64 bytes from 192.168.130.100: icmp_seq=921 ttl=63 time=202 ms
64 bytes from 192.168.130.100: icmp_seq=922 ttl=63 time=110 ms
From 192.168.8.1 icmp_seq=930 Destination Host Unreachable
From 192.168.8.1 icmp_seq=931 Destination Host Unreachable
From 192.168.8.1 icmp_seq=933 Destination Host Unreachable
From 192.168.8.1 icmp_seq=934 Destination Host Unreachable
64 bytes from 192.168.130.100: icmp_seq=935 ttl=63 time=4656 ms
64 bytes from 192.168.130.100: icmp_seq=936 ttl=63 time=3651 ms
Adding the 5 second sleep avoids the downtime. Note that I am using AWS NLB, which may take some little time to reconfigure the endpoints.
This is the config I am using:
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: udp-gateway
namespace: stunner
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb" # For NLB
service.beta.kubernetes.io/aws-load-balancer-subnets: "subnet-yyy,subnet-xxx,subnet-zzz" # Replace with your subnet IDs
service.beta.kubernetes.io/aws-load-balancer-internal: "false" # Optional: For internal load balancer
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: HTTP
service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/live"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "8086"
spec:
gatewayClassName: stunner-gatewayclass
listeners:
- name: udp-listener
port: 3478
protocol: TURN-UDP
---
apiVersion: stunner.l7mp.io/v1
kind: UDPRoute
metadata:
name: stunner-headless
namespace: stunner
spec:
parentRefs:
- name: udp-gateway
rules:
- backendRefs:
- name: udp-gateway
namespace: stunner
I'm afraid you're using graceful shutdown incorrectly. The idea is that as long as there are active TURN allocations STUNner will refuse to shut down even if the stunnerd
pod is deleted (at least until the terminationGracePeriod
passes). If, however, there are no active allocations, there is no harm in removing a stunnerd
pod so it will terminate immediately on being deleted. Therefore graceful shutdown cannot be tested with ping: you have to actually create a TURN allocation (say, using the iperf test) and then kill the offending stunnerd
pod. You should see your connection remaining open, despite the stunnerd
pod entering the TERMINATE state. In this case, there's no reason to add a preStop
hook. If this is not what's happening, please file an issue because that's a serious bug we have to investigate and fix.
I believe the core issue isn't that the current connection is broken. The problem arises during traffic redirection, where there's a race condition between the removal of endpoint IPs and the traffic being sent to them. In our scenario, affecting not just Stunner but other Kubernetes services as well, the Network Load Balancer (NLB) attempts to direct traffic to pods that are in the process of gracefully shutting down. As a result, for a few seconds, new traffic is directed to pods that have already received the SIGTERM signal and are no longer accepting new requests.
Now we're talking!...:-) Interesting. Is this AWS/EKS?
Anyway, by design STUNner does accept new connections during graceful shutdown exactly to address buggy/slow load balancers. This makes the prestop hook unnecessary. Do you have actual tests where this fails?
Thanks for looking into this. I am not able to reproduce the issue any more. Even grace-deleting all the running stunnerd pods, the connection persists stable.
We're so grateful you tested this! Graceful shutdown is one of the most critical features a cloud-native software must provide but getting this right, especially for gateways, is quite difficult. STUNner's graceful shutdown codepath was broken for a long time and I'm so happy we could finally restore this useful feature before v1!
We will perform some intensive testing in the next days. I will share feedback. Thanks
Description
Client's TURN connections should remain alive after the baxckend pod they are talking to via STUNner enters graceful shutdown. However, currently connections break as soon as the backend pod status becomes
Terminating
.Steps to Reproduce
Fire up the UDP greeter example and add a sleep to the
preStop
hook to simulate graceful shutdown:Create a client connection:
Kill the udp-greeter pod but make sure to keep
turncat
open:Sending anything on the
turncat
connection silently fails.Expected behavior: The
turncat
connection should remain functional until the the backend pod finishes graceful shutdown.Actual behavior: The connection breaks and the
stunnerd
logs show the below:Versions
Latest master. The regression must have appeared during the
stunnerd
media routing rewrite.