kubeovn / kube-ovn

A Bridge between SDN and Cloud Native (Project under CNCF)
https://kubeovn.github.io/docs/stable/en/
Apache License 2.0
1.91k stars 435 forks source link

kube-ovn 1.12 master kube-ovn-cni readiness probe failed #2409

Closed bobz965 closed 1 year ago

bobz965 commented 1 year ago

Expected Behavior

kube-ovn-cni readiness probe should be ok

Actual Behavior

kube-ovn-cni readiness probe failed

Steps to Reproduce the Problem

  1. update kube-ovn from 1.11 to master (1.12)

Additional Info

"CentOS Stream 8" 5.4.210-1.el8.elrepo.x86_64


``` bash

[root@k8s-ctrl-1 ovn]# k get po -A -o wide | grep ovn
kube-system            kube-ovn-cni-99spv                              0/1     CrashLoopBackOff    6 (97s ago)        8m51s   10.5.32.22     k8s-ctrl-2   <none>           <none>
kube-system            kube-ovn-cni-cmsrv                              0/1     CrashLoopBackOff    6 (110s ago)       8m50s   10.5.32.23     k8s-ctrl-3   <none>           <none>
kube-system            kube-ovn-cni-z54mn                              0/1     CrashLoopBackOff    6 (105s ago)       8m53s   10.5.32.21     k8s-ctrl-1   <none>           <none>
kube-system            kube-ovn-controller-78fbdb4cfc-tcx5z            1/1     Running             0                  27m     10.5.32.21     k8s-ctrl-1   <none>           <none>
kube-system            kube-ovn-monitor-689675c888-zwnmw               1/1     Running             0                  171m    10.5.32.21     k8s-ctrl-1   <none>           <none>
kube-system            kube-ovn-pinger-4ddk2                           0/1     Terminating         0                  70d     10.6.2.88      k8s-ctrl-1   <none>           <none>
kube-system            kube-ovn-pinger-4ggrg                           1/1     Running             1 (42d ago)        70d     10.6.2.90      k8s-ctrl-2   <none>           <none>
kube-system            kube-ovn-pinger-l9j49                           0/1     ContainerCreating   0                  3h14m   <none>         k8s-ctrl-3   <none>           <none>
kube-system            kube-ovn-webhook-64df95846b-bqtbq               1/1     Running             0                  6d      10.5.32.21     k8s-ctrl-1   <none>           <none>
kube-system            ovn-central-797dc7cd87-h64sg                    1/1     Running             0                  28m     10.5.32.22     k8s-ctrl-2   <none>           <none>
kube-system            ovn-central-797dc7cd87-t5d4q                    1/1     Running             0                  28m     10.5.32.21     k8s-ctrl-1   <none>           <none>
kube-system            ovn-central-797dc7cd87-tnfkq                    1/1     Running             0                  28m     10.5.32.23     k8s-ctrl-3   <none>           <none>
kube-system            ovs-ovn-cwsbt                                   1/1     Running             2 (42d ago)        75d     10.5.32.22     k8s-ctrl-2   <none>           <none>
kube-system            ovs-ovn-p5b4k                                   1/1     Running             2 (42d ago)        75d     10.5.32.23     k8s-ctrl-3   <none>           <none>
kube-system            ovs-ovn-zn9rz                                   1/1     Running             0                  75d     10.5.32.21     k8s-ctrl-1   <none>           <none>

# 可以看到处理ovn-cni的pod基本都是正常的,
kube-ovn-controller log 正常启动
ovn-central 正常启动
ovs-ovn 正常启动

下面是kube-ovn-cni的describe详情信息,以及ovn-cni的启动log

# k describe po -n kube-system            kube-ovn-cni-99spv
Name:                 kube-ovn-cni-99spv
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 k8s-ctrl-2/10.5.32.22
Start Time:           Wed, 01 Mar 2023 13:59:44 +0800
Labels:               app=kube-ovn-cni
                      component=network
                      controller-revision-hash=7988b4f9c8
                      pod-template-generation=5
                      type=infra
Annotations:          <none>
Status:               Running
IP:                   10.5.32.22
IPs:
  IP:           10.5.32.22
Controlled By:  DaemonSet/kube-ovn-cni
Init Containers:
  install-cni:
    Container ID:  containerd://1fc4f2cab8b684cb6bb285716dbe3337886564345df003f041d7cce304f2719a
    Image:         kubeovn/kube-ovn:v1.12.0
    Image ID:      docker.io/kubeovn/kube-ovn@sha256:962f09054cf824e4eab6ec8dccfe61444ae9b021e630c80ad2ec36c5fcf9839d
    Port:          <none>
    Host Port:     <none>
    Command:
      /kube-ovn/install-cni.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 01 Mar 2023 13:59:44 +0800
      Finished:     Wed, 01 Mar 2023 13:59:44 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /opt/cni/bin from cni-bin (rw)
      /usr/local/bin from local-bin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-drb6d (ro)
Containers:
  cni-server:
    Container ID:  containerd://14ddf93736da972875ecdfbf830009470aab8fefc5935ceb2ef05f31efa1c586
    Image:         kubeovn/kube-ovn:v1.12.0
    Image ID:      docker.io/kubeovn/kube-ovn@sha256:962f09054cf824e4eab6ec8dccfe61444ae9b021e630c80ad2ec36c5fcf9839d
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      /kube-ovn/start-cniserver.sh
    Args:
      --enable-mirror=false
      --encap-checksum=true
      --service-cluster-ip-range=10.7.0.0/16
      --iface=tunnel
      --dpdk-tunnel-iface=br-phy
      --network-type=geneve
      --default-interface-name=
      --cni-conf-name=01-kube-ovn.conflist
      --logtostderr=false
      --alsologtostderr=true
      --log_file=/var/log/kube-ovn/kube-ovn-cni.log
      --log_file_max_size=0
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    143
      Started:      Wed, 01 Mar 2023 14:09:44 +0800
      Finished:     Wed, 01 Mar 2023 14:10:28 +0800
    Ready:          False
    Restart Count:  7
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   100Mi
    Liveness:   tcp-socket :10665 delay=30s timeout=3s period=7s #success=1 #failure=3
    Readiness:  tcp-socket :10665 delay=0s timeout=3s period=7s #success=1 #failure=3
    Environment:
      ENABLE_SSL:            false
      POD_IP:                 (v1:status.podIP)
      KUBE_NODE_NAME:         (v1:spec.nodeName)
      MODULES:               kube_ovn_fastpath.ko
      RPMS:                  openvswitch-kmod
      POD_IPS:                (v1:status.podIPs)
      ENABLE_BIND_LOCAL_IP:  true
    Mounts:
      /etc/cni/net.d from cni-conf (rw)
      /etc/localtime from localtime (rw)
      /etc/openvswitch from systemid (rw)
      /lib/modules from host-modules (ro)
      /run/openvswitch from host-run-ovs (rw)
      /run/ovn from host-run-ovn (rw)
      /tmp from tmp (rw)
      /var/lib/kubelet/pods from shared-dir (rw)
      /var/log/kube-ovn from kube-ovn-log (rw)
      /var/log/openvswitch from host-log-ovs (rw)
      /var/log/ovn from host-log-ovn (rw)
      /var/run/netns from host-ns (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-drb6d (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  host-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:
  shared-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/pods
    HostPathType:
  systemid:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/origin/openvswitch
    HostPathType:
  host-run-ovs:
    Type:          HostPath (bare host directory volume)
    Path:          /run/openvswitch
    HostPathType:
  host-run-ovn:
    Type:          HostPath (bare host directory volume)
    Path:          /run/ovn
    HostPathType:
  cni-conf:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:
  cni-bin:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:
  host-ns:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/netns
    HostPathType:
  host-log-ovs:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/openvswitch
    HostPathType:
  kube-ovn-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/kube-ovn
    HostPathType:
  host-log-ovn:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/ovn
    HostPathType:
  localtime:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/localtime
    HostPathType:
  tmp:
    Type:          HostPath (bare host directory volume)
    Path:          /tmp
    HostPathType:
  local-bin:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/bin
    HostPathType:
  kube-api-access-drb6d:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 :NoSchedule op=Exists
                             :NoExecute op=Exists
                             CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  11m                    default-scheduler  Successfully assigned kube-system/kube-ovn-cni-99spv to k8s-ctrl-2
  Normal   Pulled     11m                    kubelet            Container image "kubeovn/kube-ovn:v1.12.0" already present on machine
  Normal   Created    11m                    kubelet            Created container install-cni
  Normal   Started    11m                    kubelet            Started container install-cni
  Normal   Pulled     10m (x2 over 11m)      kubelet            Container image "kubeovn/kube-ovn:v1.12.0" already present on machine
  Normal   Created    10m (x2 over 11m)      kubelet            Created container cni-server
  Warning  Unhealthy  10m (x3 over 10m)      kubelet            Liveness probe failed: dial tcp 10.5.32.22:10665: connect: connection refused
  Normal   Killing    10m                    kubelet            Container cni-server failed liveness probe, will be restarted
  Warning  Unhealthy  10m (x12 over 11m)     kubelet            Readiness probe failed: dial tcp 10.5.32.22:10665: connect: connection refused
  Warning  BackOff    6m21s (x3 over 6m32s)  kubelet            Back-off restarting failed container
  Normal   Started    86s (x8 over 11m)      kubelet            Started container cni-server

# k logs -f -n kube-system            kube-ovn-cni-99spv
setting sysctl variable "net.ipv4.neigh.default.gc_thresh1" to "1024"
net.ipv4.neigh.default.gc_thresh1 = 1024
setting sysctl variable "net.ipv4.neigh.default.gc_thresh2" to "2048"
net.ipv4.neigh.default.gc_thresh2 = 2048
setting sysctl variable "net.ipv4.neigh.default.gc_thresh3" to "4096"
net.ipv4.neigh.default.gc_thresh3 = 4096
setting sysctl variable "net.netfilter.nf_conntrack_tcp_be_liberal" to "1"
net.netfilter.nf_conntrack_tcp_be_liberal = 1
I0301 14:09:45.728595 3363292 cniserver.go:35]
-------------------------------------------------------------------------------
Kube-OVN:
  Version:       v1.12.0
  Build:         2023-02-28_14:17:48
  Commit:        git-db435dc
  Go Version:    go1.20.1
  Arch:          amd64
-------------------------------------------------------------------------------
I0301 14:09:45.817002 3363292 config.go:148] node name not specified in command line parameters, fall back to the environment variable
I0301 14:09:45.817039 3363292 config.go:315] no --kubeconfig, use in-cluster kubernetes config
I0301 14:09:45.850918 3363292 config.go:229] use 10.5.205.22 on tunnel as tunnel address
I0301 14:09:45.859093 3363292 config.go:166] daemon config: &{tunnel tunnel br-phy 1500 1460 false mirror0 /run/openvswitch/kube-ovn-daemon.sock /run/openvswitch/db.sock  0xc000782b60 0xc0005f0420 k8s-ctrl-2 10.7.0.0/16 join  true false false 10665 geneve /etc/cni/net.d /kube-ovn/01-kube-ovn.conflist 01-kube-ovn.conflist provider  kube-system external true}
I0301 14:09:45.891603 3363292 cniserver.go:179] finish adding chassis annotation
I0301 14:09:45.951277 3363292 ovs_linux.go:400] wait ovn0 gw ready
[root@k8s-ctrl-1 ovn]#

# 可以看到ovn-cni 本身log没有任何error
oilbeater commented 1 year ago

The ovs-ovn pod will not restart automatically, you need to restart the ovs-ovn manually after upgrade.

The v1.12.0 update both ovn and ovs, and will lead miss math if ovs not restart.

bobz965 commented 1 year ago

The ovs-ovn pod will not restart automatically, you need to restart the ovs-ovn manually after upgrade.

The v1.12.0 update both ovn and ovs, and will lead miss math if ovs not restart.

yeah, i restarted all the ovn ovs pod, it still has this issue.

and i dived into this issue. after upgraded, all the vpc subnet pod can not access its gw. the ovn-cni pod still in the process of the 200 times join gw check.
the process takes too long, after readiness probe time out and the ovn cni still not run listen on the 10665 port (this code line).

i make ovn-central replicas=0, cleaned all the db of ovn-central, and make ovn-central replicas=3. after this, the gw can be accessed, and ovn-cni works well。

i will add some log about this issue, for debug faster。

bobz965 commented 1 year ago

我会再复现该升级步骤,可能是升级前后,ovn-central db出了点问题,但重建db可恢复。

我的环境初始版本为kube-ovn 1.12, 基于install.sh 退回到1.10.9,nat-gw pod 公网功能,node到pod,vpc 内pod到pod(同节点或者跨节点)网络功能都是ok的,然后再升到1.12。 功能依然是ok的。 没有复现网关不通的bug