DaemonSet pods scheduled on one node in my 4-workernode cluster does not get proper networking.

bittrance commented 2 months ago

What happened:

TL;DR: DaemonSet pods scheduled on one node in my 4-workernode IPVS Kind cluster does not get proper networking. Anecdotal evidence suggests that the same problem occurs with iptables-based networking. I have not tried pods created from deployments or manually. Smaller clusters does not seem to have this problem.

The original error that led me down into this rabbit hole was working with spegel which uses the coordination API. Three nodes always joined up properly, while the fourth failed with:

Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/spegel/leases/spegel-leader-election": dial tcp 10.96.0.1:443: connect: no route to host

I created a test case for this which can reproduce this issue with a relatively simple setup. This case too passes when the cluster has 3 worker nodes, but fails when it get 4 worker nodes. In the test case, pods fail on name resolution, see below. The same node will fail its pods in all daemonsets launched into the cluster. Creating a 5-workernode cluster will result in two bad nodes.

What you expected to happen:

I expected to get working networking on all my nodes, but at the same time this issue seems to obvious to have remain undiscovered?

How to reproduce it (as minimally and precisely as possible):

This repeat uses apt-get update as it is present in a well-used image. Pods on healthy nodes will happily update their package metadata, but one will fail on name resolution.

$ cat <<EOF > ./config.yaml
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
networking:
  kubeProxyMode: "ipvs"
nodes:
  - role: control-plane
  - role: worker
  - role: worker
  - role: worker
  - role: worker
EOF
$ kind create cluster --config ./config.yaml
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.29.2) 🖼
 ✓ Preparing nodes 📦 📦 📦 📦 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
 ✓ Joining worker nodes 🚜
Set kubectl context to "kind-kind"

$ cat <<EOF > ./test.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: test
spec:
  selector:
    matchLabels:
      name: test
  template:
    metadata:
      labels:
        name: test
    spec:
      containers:
      - name: test
        image: ubuntu:latest
        command:
        - /bin/bash
        - -c
        - "apt-get update && sleep 100000"
EOF
$ kubectl create namespace test-1
$ kubectl create --namespace test-1 --filename ./test.yaml
$ kubectl logs --namespace test-1 --selector name=test --prefix | grep failure
[pod/test-5vznr/test]   Temporary failure resolving 'archive.ubuntu.com'
[pod/test-5vznr/test]   Temporary failure resolving 'archive.ubuntu.com'

$ kubectl describe pods --namespace test-1 test-5vznr | grep Node:
Node:             kind-worker2/172.19.0.5

Creating and populating additional namespaces will demonstrate that all failing pods will be on the same node, in this case kind-worker2.

Environment:

kind version: kind v0.22.0 go1.18.1 linux/amd64
Runtime info: Docker Engine - Community 26.0.2
OS: Ubuntu 22.04.4 LTS
Kubernetes version: v1.29.2
Any proxies or other special environment settings?: Nope.

$ docker info
Client: Docker Engine - Community
 Version:    26.0.2
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.14.0
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.26.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 40
  Running: 5
  Paused: 0
  Stopped: 35
 Images: 533
 Server Version: 26.0.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: active
  NodeID: y9pvqj4lb8hj4gq533lf8gdsd
  Is Manager: true
  ClusterID: mjd5fxlzt8yxgw1rusd8uabsz
  Managers: 1
  Nodes: 1
  Default Address Pool: 10.0.0.0/8
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 192.168.1.190
  Manager Addresses:
   192.168.1.190:2377
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: e377cd56a71523140ca6ae87e30244719194a521
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.5.0-28-generic
 Operating System: Ubuntu 22.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.05GiB
 Name: bittrance
 ID: 17cef58c-eed0-4c17-b15f-5a9dac4bb848
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

A node with bad netwoking has this description:

Name:               kind-worker
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=kind-worker
                    kubernetes.io/os=linux
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 03 May 2024 22:46:37 +0200
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  kind-worker
  AcquireTime:     <unset>
  RenewTime:       Fri, 03 May 2024 22:54:49 +0200
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Fri, 03 May 2024 22:53:47 +0200   Fri, 03 May 2024 22:46:37 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 03 May 2024 22:53:47 +0200   Fri, 03 May 2024 22:46:37 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Fri, 03 May 2024 22:53:47 +0200   Fri, 03 May 2024 22:46:37 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Fri, 03 May 2024 22:53:47 +0200   Fri, 03 May 2024 22:46:41 +0200   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.19.0.2
  Hostname:    kind-worker
Capacity:
  cpu:                8
  ephemeral-storage:  959122528Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32559536Ki
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  959122528Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32559536Ki
  pods:               110
System Info:
  Machine ID:                 240e59c9a7074bcaa24ad35ec3869116
  System UUID:                b2e313d3-4090-412c-9db2-ac5709837954
  Boot ID:                    69ec0e87-fcd4-4eb7-b752-51e7b32364b0
  Kernel Version:             6.5.0-28-generic
  OS Image:                   Debian GNU/Linux 12 (bookworm)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.13
  Kubelet Version:            v1.29.2
  Kube-Proxy Version:         v1.29.2
PodCIDR:                      10.244.4.0/24
PodCIDRs:                     10.244.4.0/24
ProviderID:                   kind://docker/kind/kind-worker
Non-terminated Pods:          (3 in total)
  Namespace                   Name                CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                ------------  ----------  ---------------  -------------  ---
  kube-system                 kindnet-hkltj       100m (1%)     100m (1%)   50Mi (0%)        50Mi (0%)      8m18s
  kube-system                 kube-proxy-hx8pj    0 (0%)        0 (0%)      0 (0%)           0 (0%)         8m18s
  test-1                      test-mjnjv          0 (0%)        0 (0%)      0 (0%)           0 (0%)         6m43s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                100m (1%)  100m (1%)
  memory             50Mi (0%)  50Mi (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
Events:
  Type    Reason                   Age                    From             Message
  ----    ------                   ----                   ----             -------
  Normal  NodeHasSufficientMemory  8m18s (x2 over 8m18s)  kubelet          Node kind-worker status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    8m18s (x2 over 8m18s)  kubelet          Node kind-worker status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     8m18s (x2 over 8m18s)  kubelet          Node kind-worker status is now: NodeHasSufficientPID
  Normal  Starting                 8m16s                  kubelet          Starting kubelet.
  Normal  NodeAllocatableEnforced  8m16s                  kubelet          Updated Node Allocatable limit across pods
  Normal  NodeHasSufficientMemory  8m16s                  kubelet          Node kind-worker status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    8m16s                  kubelet          Node kind-worker status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     8m16s                  kubelet          Node kind-worker status is now: NodeHasSufficientPID
  Normal  RegisteredNode           8m14s                  node-controller  Node kind-worker event: Registered Node kind-worker in Controller
  Normal  NodeReady                8m14s                  kubelet          Node kind-worker status is now: NodeReady

liangyuanpeng commented 2 months ago

I have follow your steps and have not reproduce it. maybe spegel have change something, i'm not sure about that, i have not try the spegel. @bittrance

$ kubectl get node
NAME                 STATUS   ROLES           AGE     VERSION
kind-control-plane   Ready    control-plane   2m33s   v1.29.2
kind-worker          Ready    <none>          2m10s   v1.29.2
kind-worker2         Ready    <none>          2m10s   v1.29.2
kind-worker3         Ready    <none>          2m11s   v1.29.2
kind-worker4         Ready    <none>          2m11s   v1.29.2
$ kubectl get pod
No resources found in default namespace.
$ kubectl get pod -n test-1
NAME         READY   STATUS    RESTARTS   AGE
test-9jpd5   1/1     Running   0          29s
test-nrbzg   1/1     Running   0          29s
test-pmxcw   1/1     Running   0          29s
test-z2zsb   1/1     Running   0          29s

bittrance commented 2 months ago

Ah, mystery solved.

$ kubectl get pods -A | grep -v Running
NAMESPACE            NAME                                         READY   STATUS             RESTARTS        AGE
kube-system          kube-proxy-nq9h6                             0/1     CrashLoopBackOff   17 (5m6s ago)   66m
$ kubectl logs --namespace kube-system kube-proxy-nq9h6
E0506 17:32:51.755163       1 run.go:74] "command failed" err="failed complete: too many open files"

I had expected Kind to tell me things like that when it creates the cluster, but I suppose it is hard to see "inside" the node.

BenTheElder commented 2 months ago

Yeah, it's a pretty complicated space and Kubernetes is supposed to be eventually-consistent. We need the control plane / API to come up and ensure this much

... but it's possible parts of the dataplane are still unhealthy, possibly even this is expected e.g. when the user opts to disable the built in CNI and needs to continue dataplane bootstrapping themselves by installing a CNI.

BenTheElder commented 2 months ago

Glad you found it. This may be https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files

kubernetes-sigs / kind

DaemonSet pods scheduled on one node in my 4-workernode cluster does not get proper networking. #3602