kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.28k stars 39.45k forks source link

Conntrack tables having stale entries for UDP connection #125467

Closed mohideenibrahim08 closed 2 months ago

mohideenibrahim08 commented 3 months ago

What happened?

We experienced an EC2 node failure within our EKS cluster. This affected node was running two CoreDNS pods, which are responsible for DNS resolution in our Kubernetes cluster. Envoy connects to CoreDNS through the UDP protocol. After these CoreDNS pods were terminated, Envoy continued to attempt connections to the terminated IP for DNS resolution. The kube-proxy failed to update the entry in the conntrack tables, causing some Envoy pods to still connect to the terminated CoreDNS pod IP. Once we restarted the Envoy pods, the entry was refreshed, and the DNS timeout issue was resolved.

Mapping in Conntrack table for src pod ip-10.103.83.53 for UDP protocol.

Query : “conntrack -p udp -L --src 10.103.83.53”

Response : “udp 17 27 src=10.103.83.53 dst= sport=21667 dport=53 [UNREPLIED] src=10.103.78.37 dst=10.103.83.53 sport=53 dport=21667 mark=0 use=1 contrack v1.4.4 (conntrack-tools): 1 flow entries have been shown”

What did you expect to happen?

Kubeproxy should update or refresh this conntrack table. conntrack shouldn't have stale UDP connection routes.

How can we reproduce it (as minimally and precisely as possible)?

KubeProxy version we tested with - kube-proxy:v1.29.4-minimal-eksbuild.1

which include this fix as well - https://github.com/kubernetes/kubernetes/issues/119249

Steps we followed in our EKS cluster to stimulate this issue

Anything else we need to know?

Kubernetes version

```console $ kubectl version Server Version: v1.29.4-eks-036c24b ```

Cloud provider

AWS

OS version

```console # On Linux: $ cat /etc/os-release NAME="Amazon Linux" VERSION="2" ID="amzn" ID_LIKE="centos rhel fedora" VERSION_ID="2" PRETTY_NAME="Amazon Linux 2" ANSI_COLOR="0;33" CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2" HOME_URL="https://amazonlinux.com/" SUPPORT_END="2025-06-30" $ uname -a Linux ip-10-185-97-105.ec2.internal 5.10.215-203.850.amzn2.aarch64 #1 SMP Tue Apr 23 20:32:21 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

samof76 commented 3 months ago

/sig network

samof76 commented 3 months ago

/area kube-proxy

alexku7 commented 3 months ago

We have the same issue but with ngnix proxy

Once the coreDns pod restarted, the ngnix proxy still tries to resolve names by old coreDns ip.

aojea commented 3 months ago

the reproducer seems very invasive, either way, you need to provide logs and timing of the events, run kube-proxy with -v4 per example

shaneutt commented 3 months ago

/assign @aojea

shaneutt commented 3 months ago

We talked about this in the SIG Network meeting today, this may relate https://github.com/kubernetes/kubernetes/issues/112604

andrewtagg-db commented 3 months ago

We also experienced a similar issue with Envoy in AKS with kube-proxy image mcr.microsoft.com/oss/kubernetes/kube-proxy:v1.28.5-hotfix.20240411. To validate the cause we deleted stale conntrack entries and traffic to the DNS service began working again. To mitigate until there is a fix we are reconfiguring Envoy to use tcp for DNS requests.

samof76 commented 3 months ago

@shaneutt what was decided in the meeting?

aojea commented 3 months ago

https://github.com/kubernetes/kubernetes/issues/125467#issuecomment-2179333188

That we need to investigate it and find the root cause, we need logs of kube-proxy identifying the problematic IP that leaves stale entries

alexku7 commented 3 months ago

We succeeded to reproduce the issue constantly by : kill -STOP <coreDNS pid> <kubeproxy pid> after that you can restart the kube dns and proxy pod normally and the problem occurs.

Looks like the problem affects only processes using udp protocol and the same source port for the dns queries. Also they need to retry the dns queries from the same source port constantly in order to prevent the reaching the timeoput of 120seconds. If the envoy/ngnix or any other tool stops trying the resolving attempts, the contrack table will be updated after 120 sec.

The parameter which affects this timeout is nf_conntrack_udp_timeout_stream

A simple python script for testing. The script uses the udp protocol and the same source port

import socket
import struct
import random
import time

def create_dns_query(domain):
    transaction_id = random.randint(0, 65535)
    flags = 0x0100  # Standard query with recursion desired
    header = struct.pack('!HHHHHH', transaction_id, flags, 1, 0, 0, 0)

    question = b''
    for part in domain.split('.'):
        question += struct.pack('B', len(part)) + part.encode()
    question += b'\x00'  # Terminating null byte

    question += struct.pack('!HH', 1, 1)  # QTYPE (A record) and QCLASS (IN)

    return header + question

def send_dns_query(sock, domain, dns_server="172.20.0.10"):
    try:
        query = create_dns_query(domain)
        sock.sendto(query, (dns_server, 53))

        sock.settimeout(2)  # Set a timeout for receiving the response
        response, _ = sock.recvfrom(1024)

        flags = struct.unpack('!H', response[2:4])[0]
        rcode = flags & 0xF

        ancount = struct.unpack('!H', response[6:8])[0]

        print(f"DNS Response for {domain}:")
        print(f"Response Code: {rcode}")
        print("Answer Section:")

        offset = 12
        while response[offset] != 0:
            offset += 1
        offset += 5

        for _ in range(ancount):
            if (response[offset] & 0xC0) == 0xC0:
                offset += 2
            else:
                while response[offset] != 0:
                    offset += 1
                offset += 1

            rec_type, rec_class, ttl, data_len = struct.unpack('!HHIH', response[offset:offset+10])
            offset += 10

            if rec_type == 1:  # A record
                ip = '.'.join(map(str, response[offset:offset+4]))
                print(f"{domain} {ttl} IN A {ip}")

            offset += data_len

    except socket.timeout:
        print(f"Error: DNS query timed out for {domain}")
    except Exception as e:
        print(f"Error occurred while querying {domain}: {str(e)}")

def continuous_dns_resolution(domain, source_port=12345, interval=1):
    print(f"Starting continuous DNS resolution for {domain} every {interval} second(s)")
    print("Press Ctrl+C to stop the script")

    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    try:
        sock.bind(('', source_port))

        while True:
            send_dns_query(sock, domain)
            print("\n")  # Add a newline for better readability between queries
            time.sleep(interval)
    except KeyboardInterrupt:
        print("\nScript terminated by user")
    finally:
        sock.close()

# Example usage
if __name__ == "__main__":
    domain_to_query = "example.com"
    continuous_dns_resolution(domain_to_query)

To sumarize :

Under some conditions The kube-proxy doesn't update the conntrack table if the udp and the same source port are used.

aojea commented 3 months ago

This is a duplicate of https://github.com/kubernetes/kubernetes/issues/122740

We succeeded to reproduce the issue constantly by : kill -STOP after that you can restart the kube dns and proxy pod normally and the problem occurs.

though, this is really something users should take into account when doing rolling updates, and can be mitigated implementing best practices ... is not common to start killing all the pods on a node, you should first drain the node https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/

samof76 commented 3 months ago

@aojea agree with your suggestion on draining nodes safely, but in cases where the node fails, which happened on this issue log, it is the kube-proxy that should respond to such failure, dropping and recreating the conntrack entries appropriately.

aojea commented 3 months ago

@aojea agree with your suggestion on draining nodes safely, but in cases where the node fails, which happened on this issue log, it is the kube-proxy that should respond to such failure, dropping and recreating the conntrack entries appropriately.

the bug is legit, but I think that is hard to hit, what happened in this issue description is that the user is manually forcing this scenarios that is known to fail and documented in #122740

alexku7 commented 3 months ago

@aojea agree with your suggestion on draining nodes safely, but in cases where the node fails, which happened on this issue log, it is the kube-proxy that should respond to such failure, dropping and recreating the conntrack entries appropriately.

the bug is legit, but I think that is hard to hit, what happened in this issue description is that the user is manually forcing this scenarios that is known to fail and documented in #122740

In our case the node went to not ready state because of the overload. So the kubeproxy has stuck for some time. The core dns has been evicted.

After a while the node recovered but entire cluster became almost malfunctioned because of the dns issue So we had to restart many pods where affected by that udp bug

aojea commented 3 months ago

In our case the node went to not ready state because of the overload. So the kubeproxy has stuck for some time.

this is interesting ... why kube-proxy got stuck?

alexku7 commented 3 months ago

In our case the node went to not ready state because of the overload.

So the kubeproxy has stuck for some time.

this is interesting ... why kube-proxy got stuck?

The whole node got stuck because of memory overloading.

It's a bit complicated to reproduce in the lab, but happened twice in two different eks clusters. Aws support recommended us to give more memory to the kubeReserved as a mitigation.

aojea commented 3 months ago

no worries, I think what you are commenting is what Dan Winship describes here https://github.com/kubernetes/kubernetes/issues/112604

/triage accepted

balu-ce commented 3 months ago

you need to provide logs and timing of the events, run kube-proxy with -v4 per example

I0625 13:13:20.351149       1 proxier.go:796] "Syncing iptables rules"
I0625 13:13:20.351454       1 iptables.go:358] "Running" command="iptables-save" arguments=["-t","nat"]
I0625 13:13:20.355238       1 proxier.go:1504] "Reloading service iptables data" numServices=40 numEndpoints=52 numFilterChains=6 numFilterRules=8 numNATChains=8 numNATRules=45
I0625 13:13:20.355259       1 iptables.go:423] "Running" command="iptables-restore" arguments=["-w","5","-W","100000","--noflush","--counters"]
I0625 13:13:20.359194       1 proxier.go:1533] "Network programming" endpoint="kube-system/kube-dns" elapsed=0.35915983
I0625 13:13:20.359258       1 cleanup.go:63] "Deleting conntrack stale entries for services" IPs=[]
I0625 13:13:20.359280       1 cleanup.go:69] "Deleting conntrack stale entries for services" nodePorts=[]
I0625 13:13:20.359306       1 conntrack.go:66] "Clearing conntrack entries" parameters=["-D","--orig-dst","172.20.0.10","--dst-nat","10.240.244.119","-p","udp"]
I0625 13:13:20.361710       1 conntrack.go:71] "Conntrack entries deleted" output=<
    conntrack v1.4.4 (conntrack-tools): 17 flow entries have been deleted.
    udp      17 19 src=10.240.244.140 dst=172.20.0.10 sport=55850 dport=53 [UNREPLIED] src=10.240.244.119 dst=10.240.244.140 sport=53 dport=55850 mark=0 use=1
    udp      17 3 src=10.240.244.34 dst=172.20.0.10 sport=59350 dport=53 src=10.240.244.119 dst=10.240.244.34 sport=53 dport=59350 mark=0 use=1
    udp      17 3 src=10.240.244.34 dst=172.20.0.10 sport=59716 dport=53 src=10.240.244.119 dst=10.240.244.34 sport=53 dport=59716 mark=0 use=1
    udp      17 19 src=10.240.244.140 dst=172.20.0.10 sport=42887 dport=53 [UNREPLIED] src=10.240.244.119 dst=10.240.244.140 sport=53 dport=42887 mark=0 use=1
    udp      17 19 src=10.240.244.140 dst=172.20.0.10 sport=41925 dport=53 [UNREPLIED] src=10.240.244.119 dst=10.240.244.140 sport=53 dport=41925 mark=0 use=1
    udp      17 19 src=10.240.244.140 dst=172.20.0.10 sport=40972 dport=53 [UNREPLIED] src=10.240.244.119 dst=10.240.244.140 sport=53 dport=40972 mark=0 use=1
    udp      17 3 src=10.240.244.34 dst=172.20.0.10 sport=51492 dport=53 src=10.240.244.119 dst=10.240.244.34 sport=53 dport=51492 mark=0 use=2
    udp      17 16 src=10.240.244.58 dst=172.20.0.10 sport=55964 dport=53 [UNREPLIED] src=10.240.244.119 dst=10.240.244.58 sport=53 dport=55964 mark=0 use=1
    udp      17 3 src=10.240.244.34 dst=172.20.0.10 sport=48529 dport=53 src=10.240.244.119 dst=10.240.244.34 sport=53 dport=48529 mark=0 use=1
    udp      17 3 src=10.240.244.34 dst=172.20.0.10 sport=43667 dport=53 src=10.240.244.119 dst=10.240.244.34 sport=53 dport=43667 mark=0 use=1
    udp      17 19 src=10.240.244.140 dst=172.20.0.10 sport=51934 dport=53 [UNREPLIED] src=10.240.244.119 dst=10.240.244.140 sport=53 dport=51934 mark=0 use=1
    udp      17 5 src=10.240.244.154 dst=172.20.0.10 sport=35714 dport=53 src=10.240.244.119 dst=10.240.244.154 sport=53 dport=35714 mark=0 use=1
    udp      17 19 src=10.240.244.140 dst=172.20.0.10 sport=58990 dport=53 [UNREPLIED] src=10.240.244.119 dst=10.240.244.140 sport=53 dport=58990 mark=0 use=1
    udp      17 19 src=10.240.244.140 dst=172.20.0.10 sport=40587 dport=53 [UNREPLIED] src=10.240.244.119 dst=10.240.244.140 sport=53 dport=40587 mark=0 use=1
    udp      17 19 src=10.240.244.140 dst=172.20.0.10 sport=48436 dport=53 [UNREPLIED] src=10.240.244.119 dst=10.240.244.140 sport=53 dport=48436 mark=0 use=1
    udp      17 23 src=10.240.244.34 dst=172.20.0.10 sport=40050 dport=53 [UNREPLIED] src=10.240.244.119 dst=10.240.244.34 sport=53 dport=40050 mark=0 use=1
    udp      17 16 src=10.240.244.191 dst=172.20.0.10 sport=40033 dport=53 [UNREPLIED] src=10.240.244.119 dst=10.240.244.191 sport=53 dport=40033 mark=0 use=1

@aojea if you need anything more, pls let us know

aojea commented 3 months ago

/assign

thanks,

andrewtagg-db commented 2 months ago

Reviewing this issue since we still see impact occasionally, in the most recent incidence of this we found and deleted this conntrack entry for a destination service with ClusterIP 192.168.0.10:

conntrack -L | grep 192.168.0.10 | grep UNREPLIED
udp      17 29 src=10.120.1.150 dst=192.168.0.10 sport=49660 dport=53 [UNREPLIED] src=192.168.0.10 dst=10.120.1.150 sport=53 dport=49660 mark=0 use=1

It looks like DNAT isn't set up. After deleting it the next udp request succeeded and requests to the ClusterIP began working again.

Reviewing some related issues, I found the following PR which seems like it would resolve our case: https://github.com/kubernetes/kubernetes/pull/122741 (related to https://github.com/kubernetes/kubernetes/issues/122740).

aojea commented 2 months ago

/close

let's centralize everything here https://github.com/kubernetes/kubernetes/issues/126130

k8s-ci-robot commented 2 months ago

@aojea: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/125467#issuecomment-2237057275): >/close > >let's centralize everything here https://github.com/kubernetes/kubernetes/issues/126130 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
danwinship commented 5 days ago

FTR this should now be fixed in master (separately from the larger conntrack reconciler work); #127808 is a cherry-pick to release-1.29