k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
27.81k stars 2.33k forks source link

k3s service doesn't work when restart network #9855

Closed xiaosaxu closed 3 months ago

xiaosaxu commented 6 months ago

Environmental Info: K3s Version:

image

Node(s) CPU architecture, OS, and Version:

image

Cluster Configuration:

image

Describe the bug:

Steps To Reproduce:

brandond commented 6 months ago

Please attach the actual k3s service logs from journald. Can you confirm that your IP address is not changing when you restart the network, and that you don't have any other host-based firewall (ufw/firewalld/etc) that is being reloaded when you restart the network?

xiaosaxu commented 6 months ago

Please attach the actual k3s service logs from journald. Can you confirm that your IP address is not changing when you restart the network, and that you don't have any other host-based firewall (ufw/firewalld/etc) that is being reloaded when you restart the network?

Thank you for your reply. I have enabled debug logging mode for the k3s service. After restarting the network and calling the service interface, I found no abnormal log output.

Apr 07 11:02:39 linkone k3s[7137]: E0407 11:02:39.290369    7137 dns.go:158] "Nameserver limits exceeded" err="Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 8.8.8.8 114.114.114.114 114.114.114.114"
Apr 07 11:03:14 linkone k3s[7137]: time="2024-04-07T11:03:14+08:00" level=info msg="COMPACT compactRev=1281 targetCompactRev=2281 currentRev=5969"
Apr 07 11:03:14 linkone k3s[7137]: time="2024-04-07T11:03:14+08:00" level=info msg="COMPACT deleted 647 rows from 1000 revisions in 54.380113ms - compacted to 2281/5969"
Apr 07 11:03:14 linkone k3s[7137]: time="2024-04-07T11:03:14+08:00" level=info msg="COMPACT compactRev=2281 targetCompactRev=3281 currentRev=5969"
Apr 07 11:03:14 linkone k3s[7137]: time="2024-04-07T11:03:14+08:00" level=info msg="COMPACT deleted 536 rows from 1000 revisions in 62.583587ms - compacted to 3281/5969"
Apr 07 11:03:14 linkone k3s[7137]: time="2024-04-07T11:03:14+08:00" level=info msg="COMPACT compactRev=3281 targetCompactRev=4281 currentRev=5969"
Apr 07 11:03:15 linkone k3s[7137]: time="2024-04-07T11:03:15+08:00" level=info msg="COMPACT deleted 624 rows from 1000 revisions in 62.928533ms - compacted to 4281/5969"
Apr 07 11:03:15 linkone k3s[7137]: time="2024-04-07T11:03:15+08:00" level=info msg="COMPACT compactRev=4281 targetCompactRev=4969 currentRev=5969"
Apr 07 11:03:15 linkone k3s[7137]: time="2024-04-07T11:03:15+08:00" level=info msg="COMPACT deleted 492 rows from 688 revisions in 42.766839ms - compacted to 4969/5969"
Apr 07 11:03:15 linkone k3s[7137]: time="2024-04-07T11:03:15+08:00" level=info msg="COMPACT revision 4969 has already been compacted"
Apr 07 11:03:16 linkone k3s[7137]: I0407 11:03:16.311041    7137 handler.go:232] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
Apr 07 11:03:16 linkone k3s[7137]: I0407 11:03:16.325403    7137 handler.go:232] Adding GroupVersion k3s.cattle.io v1 to ResourceManager
Apr 07 11:03:16 linkone k3s[7137]: I0407 11:03:16.325498    7137 handler.go:232] Adding GroupVersion helm.cattle.io v1 to ResourceManager
Apr 07 11:03:16 linkone k3s[7137]: I0407 11:03:16.384965    7137 handler.go:232] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
Apr 07 11:03:59 linkone k3s[7137]: E0407 11:03:59.290718    7137 dns.go:158] "Nameserver limits exceeded" err="Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 8.8.8.8 114.114.114.114 114.114.114.114"
Apr 07 11:04:16 linkone k3s[7137]: I0407 11:04:16.308083    7137 handler.go:232] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
Apr 07 11:05:16 linkone k3s[7137]: I0407 11:05:16.307731    7137 handler.go:232] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
Apr 07 11:05:20 linkone k3s[7137]: E0407 11:05:20.290087    7137 dns.go:158] "Nameserver limits exceeded" err="Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 8.8.8.8 114.114.114.114 114.114.114.114"
Apr 07 11:06:16 linkone k3s[7137]: I0407 11:06:16.307185    7137 handler.go:232] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
Apr 07 11:06:21 linkone k3s[7137]: E0407 11:06:21.290959    7137 dns.go:158] "Nameserver limits exceeded" err="Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 8.8.8.8 114.114.114.114 114.114.114.114"
Apr 07 11:07:16 linkone k3s[7137]: I0407 11:07:16.307531    7137 handler.go:232] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
Apr 07 11:07:23 linkone k3s[7137]: E0407 11:07:23.290480    7137 dns.go:158] "Nameserver limits exceeded" err="Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 8.8.8.8 114.114.114.114 114.114.114.114"
Apr 07 11:08:14 linkone k3s[7137]: time="2024-04-07T11:08:14+08:00" level=info msg="COMPACT compactRev=4969 targetCompactRev=5092 currentRev=6092"
Apr 07 11:08:14 linkone k3s[7137]: time="2024-04-07T11:08:14+08:00" level=info msg="COMPACT deleted 76 rows from 123 revisions in 9.25687ms - compacted to 5092/6092"
Apr 07 11:08:16 linkone k3s[7137]: I0407 11:08:16.307744    7137 handler.go:232] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
Apr 07 11:08:16 linkone k3s[7137]: I0407 11:08:16.324931    7137 handler.go:232] Adding GroupVersion k3s.cattle.io v1 to ResourceManager
Apr 07 11:08:16 linkone k3s[7137]: I0407 11:08:16.325192    7137 handler.go:232] Adding GroupVersion helm.cattle.io v1 to ResourceManager
Apr 07 11:08:16 linkone k3s[7137]: I0407 11:08:16.370666    7137 handler.go:232] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager

The host's IP address remains unchanged after the restart, and I have also disabled the firewall service. I compared the output of "ip link show" before and after the network restart, and found that it was the same.

What else can I investigate from?

manuelbuil commented 6 months ago

I understand that after running systemctl restart network you can curl kube-api clusterIP from the pod but not from the host, right? Can you curl kube-api endpoint (not clusterIP) from the host? You can see kube-api endpoint by running kubectl get endpoints

brandond commented 6 months ago

Can you show the specific curl requests you're making, and what the the output is?

xiaosaxu commented 6 months ago

@brandond @manuelbuil Thank you very much for your response.In order to better illustrate my problem, I have constructed a scenario: creating a deployment for Nginx with a service, the resource definitions are as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: nginx
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nginx
    spec:
      containers:
      - image: infra/nginx:1.23.1-alpine
        imagePullPolicy: IfNotPresent
        name: nginx
        ports:
        - containerPort: 80
          protocol: TCP
      dnsPolicy: ClusterFirst
      restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-service
spec:
  selector:
    app: nginx
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 80

The resources are created as shown below:

[root@linkone ~]# k get pod -owide
NAME                          READY   STATUS        RESTARTS       AGE   IP              NODE       NOMINATED NODE   READINESS GATES
nginx-fcc8c6678-gpcnj         1/1     Running       0              38m   113.122.0.237   uap-node   <none>           <none>
[root@linkone ~]#
[root@linkone ~]# k get service
NAME            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
kubernetes      ClusterIP   113.123.0.1      <none>        443/TCP    13d
nginx-service   ClusterIP   113.123.86.142   <none>        8080/TCP   35m
[root@linkone ~]# k get endpoints
NAME            ENDPOINTS           AGE
kubernetes      172.17.7.110:6443   13d
nginx-service   113.122.0.237:80    35m

Before restarting the network, I can both successfully use the following methods to curl from the host or from within a pod (a pod with the curl command, not the nginx pod): curl 113.122.0.237:80 (by pod ip) curl 113.123.86.142:8080 (by cluster ip) curl nginx-service.default.svc.cluster.local:8080 (by service name)

But after restarting the network(by execute systemctl restart network), in some cases, curl encounters issues, while the kubectl command continues to function normally. kubecl get pod -owide ;kubecl get service; kubectl get endpoints is the same result as before。 In the host : success curl 113.122.0.237:80 (by pod ip) success curl 113.123.86.142:8080 (by cluster ip) failed curl nginx-service.default.svc.cluster.local:8080 (by service name) image

In the pod: success curl 113.122.0.237:80 (by pod ip) failed curl 113.123.86.142:8080 (by cluster ip) failed curl nginx-service.default.svc.cluster.local:8080 (by service name) image

I compared the output of 'ipvsadm -L' and 'iptables -L' before and after restarting the network, and found that they are consistent, and the IP address has not changed.

Below is the configuration information for k3s:

[root@linkone ~]# cat /etc/systemd/system/k3s.service
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/etc/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStartPre=-/sbin/modprobe ip_vs
ExecStartPre=-/sbin/modprobe ip_vs_rr
ExecStartPre=-/sbin/modprobe ip_vs_wrr
ExecStartPre=-/sbin/modprobe ip_vs_sh
ExecStartPre=-/sbin/modprobe nf_conntrack
ExecStart=/usr/local/bin/k3s     server         '--write-kubeconfig'         '/dev/null'         '--write-kubeconfig-mode'         '666'         '--cluster-cidr'         '113.122.0.0/16'         '--service-cidr'         '113.123.0.0/16'         '--cluster-dns'         '113.123.0.10'         '--service-node-port-range'         '400-32767'         '--node-name'         'uap-node'         '--docker'         '--disable=traefik'         '--disable=coredns'         '--disable=local-storage'         '--prefer-bundled-bin'         '--flannel-ipv6-masq'         '--kube-proxy-arg=proxy-mode=ipvs'         '--kubelet-arg=eviction-hard=imagefs.available<0.00000001%,nodefs.available<0.00000001%'         '--kubelet-arg=eviction-minimum-reclaim=imagefs.available=0.00000001%,nodefs.available=0.00000001%'         '--kube-proxy-arg=proxy-mode=ipvs'         '--resolv-conf=/etc/rancher/k3s/resolv.conf'         '--kubelet-arg=config=/etc/rancher/k3s/kubelet.config'         '--node-ip'         '172.17.7.110'

My environment k3s uses Docker to manage containers. The CoreDNS component was deployed by myself using Helm and is not managed by k3s. I have also ruled out any issues related to the '--prefer-bundled-bin' and '--flannel-ipv6-masq' configurations. After restarting k3s, the issue disappears. This issue occurs with the same k3s configuration on both CentOS and BCLinux(22.10 LTS 5.10.0-60.70.0.94.oe2203.bclinux.x86_64) systems.

Thank you all for your valuable suggestions and assistance. I truly appreciate your help.

xiaosaxu commented 5 months ago

Addendum: Upgrading the k3s version to the latest v1.29.4 still has this problem.

github-actions[bot] commented 4 months ago

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.