kubeovn / kube-ovn

A Bridge between SDN and Cloud Native (Project under CNCF)
https://kubeovn.github.io/docs/stable/en/
Apache License 2.0
1.87k stars 433 forks source link

删除重建ovs-ovn pod,高频率报错/var/run/openvswitch/br-int.mgmt: connection failed (No such file or directory),pod恢复时间较长,导致虚机业务中断时间较长。 #4219

Closed mengyu1987 closed 1 week ago

mengyu1987 commented 1 week ago

Kube-OVN Version

v1.12.16

Kubernetes Version

v1.27.6

Operation-system/Kernel Version

[root@node154 openvswitch]# awk -F '=' '/PRETTY_NAME/ { print $2 }' /etc/os-release
"H3Linux 2.0.2-SP01"
[root@node154 openvswitch]# uname -r
5.10.0-136.12.0.86.4.hl202.x86_64

Description

删除重建ovs-ovn pod,高频率报错/var/run/openvswitch/br-int.mgmt: connection failed (No such file or directory),pod恢复时间较长,重复删除同一节点的ovs-ovn pod,有时报错,有时正常启动。

Steps To Reproduce

  1. 删除ovs-ovn pod后,查看启动情况,有个别Pod会出现重启一次
    [root@node154 openvswitch]# pod |grep ovs
    kube-system                    ovs-ovn-5qkv8                                                   1/1     Running                 1 (14m ago)        15m     10.210.20.154    node154   <none>
    kube-system                    ovs-ovn-flm7s                                                   1/1     Running                 0                  15m     10.210.20.153    node153   <none>
    kube-system                    ovs-ovn-n8jbs                                                   1/1     Running                 0                  15m     10.210.20.152    node152   <none>

    2.查看重启过的ovs-ovn-5qkv8 pod的启动日志,有如下报错信息: 企业微信截图_1719295680815

企业微信截图_17192867223368

3.再次删除该pod,重启又正常了,反复重复删除几次,有时会出现报错,有时是正常的。

Current Behavior

删除重建ovs-ovn pod,高频率报错,虚机业务中断时间较长。

Expected Behavior

删除重建ovs-ovn pod恢复正常。

oilbeater commented 1 week ago

@zhangzujian 感觉像启动脚本处理这种非升级情况的删除 Pod 有问题

zhangzujian commented 1 week ago

已经在 v1.12.17 中修复。建议升级到 v1.12.18 看看还有没有问题。

mengyu1987 commented 1 week ago

已经在 v1.12.17 中修复。建议升级到 v1.12.18 看看还有没有问题。

好的,多谢,我升级试试