kubeovn / kube-ovn

A Bridge between SDN and Cloud Native (Project under CNCF)
https://kubeovn.github.io/docs/stable/en/
Apache License 2.0
1.98k stars 450 forks source link

[BUG] ovs process killed, coredump #4645

Open bobz965 opened 1 month ago

bobz965 commented 1 month ago

Kube-OVN Version

master

Kubernetes Version

1.31

Operation-system/Kernel Version

6.8 github ci

Description

image

ovs killed

Steps To Reproduce

github ci

Current Behavior

ovs process killed

Expected Behavior

ovs process running

bobz965 commented 1 month ago

test arm env

image

image

the e2e failed, but the ovs pod is still running (not crashed)

bobz965 commented 1 month ago

image

bobz965 commented 1 month ago

先修复一下 arm 环境 lsp 类型 ovn eip 无法 ready 的问题: https://github.com/kubeovn/kube-ovn/pull/4647

bobz965 commented 1 month ago

image

@zcq98 目前基本确认在该步骤之后触发了 ovs pod 中,ovn-controller 进程的崩溃。

bobz965 commented 1 month ago

已确认仅创建第二个vlan子网,没有在第二个子网中创建 lrp 类型的 eip, 不会导致 ovn-controller 崩溃

image

image

可以看到 extra 还没有和 vpc 连接 image

bobz965 commented 1 month ago

一旦 extra vlan subnet 绑定到 vpc,创建出 lrp 就崩溃了 @zcq98

image

image

bobz965 commented 1 month ago

(ae86) ➜  ovs git:(main) k ko nbctl show
switch 845c29dc-823b-497e-a2b9-32c4a4a8cb25 (extra)
    port localnet.extra
        type: localnet
        addresses: ["unknown"]
    port extra-no-bfd-vpc-124613236
        type: router
        router-port: no-bfd-vpc-124613236-extra
switch 5f9821a5-b892-48b7-889e-e414c64e0efb (ovn-default)
    port kube-ovn-pinger-vksmb.kube-system
        addresses: ["5e:75:0a:8e:a7:b4 10.16.0.9"]
    port coredns-6f6b679f8f-ghz49.kube-system
        addresses: ["36:ca:58:3f:be:a8 10.16.0.7"]
    port ovn-default-ovn-cluster
        type: router
        router-port: ovn-cluster-ovn-default
    port kube-ovn-pinger-nl6jj.kube-system
        addresses: ["3e:df:d4:7f:17:77 10.16.0.8"]
    port coredns-6f6b679f8f-gzxmm.kube-system
        addresses: ["fe:fc:af:4f:5b:a7 10.16.0.6"]
switch abab4d01-5632-40b8-9dd8-fde6935ac865 (join)
    port node-kube-ovn-worker
        addresses: ["2e:d9:82:be:74:fd 100.64.0.2"]
    port join-ovn-cluster
        type: router
        router-port: ovn-cluster-join
    port node-kube-ovn-control-plane
        addresses: ["42:6e:be:d2:5f:2a 100.64.0.3"]
switch fc2d7d79-5e84-4354-9d4b-37aa30e7c80c (no-bfd-subnet-186440052)
    port no-bfd-kube-ovn-worker.ovn-vpc-nat-gw-3437
        addresses: ["ee:02:04:ca:b3:51 192.168.0.3"]
    port no-bfd-kube-ovn-control-plane.ovn-vpc-nat-gw-3437
        addresses: ["fe:34:eb:0d:e5:d3 192.168.0.2"]
    port fip-pod-141121111.ovn-vpc-nat-gw-3437
        addresses: ["d2:65:92:2c:e7:b1 192.168.0.4"]
    port no-bfd-subnet-186440052-no-bfd-vpc-124613236
        type: router
        router-port: no-bfd-vpc-124613236-no-bfd-subnet-186440052
switch 5a722dff-9753-44ee-bc87-654cda9ed951 (external)
    port localnet.external
        type: localnet
        addresses: ["unknown"]
    port external-ovn-cluster
        type: router
        router-port: ovn-cluster-external
    port external-no-bfd-vpc-124613236
        type: router
        router-port: no-bfd-vpc-124613236-external
router 04174800-2e31-446a-a754-cd8702e4ac70 (no-bfd-vpc-124613236)
    port no-bfd-vpc-124613236-no-bfd-subnet-186440052
        mac: "1a:bc:de:93:5e:eb"
        networks: ["192.168.0.1/24"]
    port no-bfd-vpc-124613236-extra
        mac: "66:97:98:78:6f:0b"
        networks: ["172.20.0.4/16"]
        gateway chassis: [8ba50fca-0bfd-4bcc-bf4d-804f2defa7a6 91d42c2a-5131-4eb1-9139-a78a7a21c34f]
    port no-bfd-vpc-124613236-external
        mac: "12:9e:c2:4d:8e:d1"
        networks: ["172.19.0.5/16"]
        gateway chassis: [8ba50fca-0bfd-4bcc-bf4d-804f2defa7a6 91d42c2a-5131-4eb1-9139-a78a7a21c34f]
    nat 1afa9eb9-24ec-4382-a7e4-fb95819a4104
        external ip: "172.19.0.5"
        logical ip: "192.168.0.0/24"
        type: "snat"
    nat 2f32799d-b30e-4b6d-8acb-8984586e905b
        external ip: "172.19.0.5"
        logical ip: "192.168.0.5"
        type: "dnat_and_snat"
    nat d771a18f-75fd-4452-ba42-f4df56b4b7c1
        external ip: "172.19.0.7"
        logical ip: "192.168.0.4"
        type: "dnat_and_snat"
router 9e47e901-d7dc-47a5-a678-919bea116d0f (ovn-cluster)
    port ovn-cluster-external
        mac: "42:cf:21:b1:4e:c9"
        networks: ["172.19.0.6/16"]
        gateway chassis: [91d42c2a-5131-4eb1-9139-a78a7a21c34f 8ba50fca-0bfd-4bcc-bf4d-804f2defa7a6]
    port ovn-cluster-ovn-default
        mac: "66:32:ec:72:c1:36"
        networks: ["10.16.0.1/16"]
    port ovn-cluster-join
        mac: "56:b3:e3:3e:2c:ea"
        networks: ["100.64.0.1/16"]
(ae86) ➜  ovs git:(main)