kubeovn / kube-ovn

A Bridge between SDN and Cloud Native (Project under CNCF)
https://kubeovn.github.io/docs/stable/en/
Apache License 2.0
1.96k stars 450 forks source link

ovn-ic的gw-nodes如果网关节点出故障,无法自动切换到配置的其他节点作为网关节点,导致通信故障 #3847

Closed geniusxiong closed 7 months ago

geniusxiong commented 7 months ago

Bug Report

ovn-ic的gw-nodes,配置成2个或多个节点作为集群互联中承担网关工作,但是配置的节点故障后,不能自动切换到其他节点继续作为网关节点,导致通信故障

Expected Behavior

Actual Behavior

ovn-ic的gw-nodes中,配置好的网关节点,如果其中一个节点故障,能够自定切换到配置的其他节点作为网关节点

Steps to Reproduce the Problem

  1. 配置一个节点ovn-ic-config ConfigMap:

    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: ovn-ic-config
    namespace: kube-system
    data:
    enable-ic: "true"
    az-name: "az176" 
    ic-db-host: "172.18.164.171"
    ic-nb-port: "6645" 
    ic-sb-port: "6646"
    gw-nodes: "node-0,node-1"
    auto-route: "true"
  2. ovn-ic 容器内已建立互联逻辑交换机 ts,集群可以互通

    root@master-1:/kube-ovn# ovn-ic-sbctl show
    availability-zone az140
    gateway b5f7b2d5-e002-486b-a933-9d30f92b09d5
        hostname: node-0
        type: geneve
            ip: 172.18.164.143
    gateway e33e7c0d-b3c3-4afd-9598-569f024aeb9e
        hostname: master-0
        type: geneve
            ip: 172.18.164.140
        port ts-az140
            transit switch: ts
            address: ["00:00:00:E3:E2:BC 169.254.100.93/24"]
    availability-zone az170
    gateway 224a129c-63ce-46fb-b94f-e87ee7bd0f52
        hostname: node-0
        type: geneve
            ip: 172.18.164.173
        port ts-az170
            transit switch: ts
            address: ["00:00:00:34:05:F3 169.254.100.80/24"]
    **availability-zone az176**
    gateway 787b7f2e-a18a-4c77-b4a1-0fcf304fbbe7
        **hostname: node-1**
        type: geneve
            ip: 172.18.164.177
        **port ts-az176**
            transit switch: ts
            address: ["00:00:00:50:E0:EB 169.254.100.90/24"]
    gateway 7e6c6ee8-5f12-4eea-b7d6-355b61297fff
        **hostname: node-0**
        type: geneve
            ip: 172.18.164.179

    此时集群互通成功,az176的pod能够访问az170和az140的pod

  3. 把集群az176的node-1节点(172.18.164.177)关机,ovn-ic 容器内已建立互联逻辑交换机 ts没有变化,并没有自动切换到配置的node-0节点作为网关维持通信,此时az176的pod不能访问az170和az140的pod

  4. 把集群az176的node-1节点(172.18.164.177)开机,通信恢复正常

Additional Info

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:58:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:51:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
1.11.3
NFS Server 4.0 (G193)
4.19.113-3.nfs.x86_64  
geniusxiong commented 7 months ago

网关节点需要配置成3个以上,才可以自动切换。