kubeovn / kube-ovn

A Bridge between SDN and Cloud Native (Project under CNCF)
https://kubeovn.github.io/docs/stable/en/
Apache License 2.0
1.94k stars 442 forks source link

ip资源未被回收,子网ip占用残留 #2125

Closed syang1997 closed 1 year ago

syang1997 commented 1 year ago

Expected Behavior

ip资源未被回收,子网ip占用残留

Actual Behavior

Steps to Reproduce the Problem

apiVersion: kubeovn.io/v1
kind: Subnet
metadata:
  name: subnet-cdq57ea8j5gqg4vf8ak0
spec:
  cidrBlock: 168.50.8.0/24
  default: false
  excludeIps:
  - 168.50.8.254
  gateway: 168.50.8.254
  gatewayNode: ""
  gatewayType: distributed
  natOutgoing: false
  private: false
  protocol: IPv4
  provider: ovn
  vpc: vpc-cdq56t28j5gqg4vf8ajg
NAME                          PROVIDER   VPC                        PROTOCOL   CIDR            PRIVATE   NAT     DEFAULT   GATEWAYTYPE   V4USED   V4AVAILABLE   V6USED   V6AVAILABLE   EXCLUDEIPS
subnet-cdq57ea8j5gqg4vf8ak0   ovn        vpc-cdq56t28j5gqg4vf8ajg   IPv4       168.50.8.0/24   false     false   false     distributed   11       242           0        0             ["168.50.8.254"]
[root@iaas-cms-ctrl-1 ~]# k get ip | grep 168.50.8.
vm-ce3fr4q8j5gh613m5u50.yiaas.net1.yiaas.ovn                                                       168.50.8.2               00:00:00:A9:2E:06   iaas-cms-ctrl-1   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce3vprq8j5ggeis9ivig.yiaas.net1.yiaas.ovn                                                       168.50.8.1               00:00:00:BF:18:12   iaas-cms-ctrl-1   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce418ki8j5ggeis9ivmg.yiaas.net1.yiaas.ovn                                                       168.50.8.2               00:00:00:21:DA:C3   iaas-cms-ctrl-2   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce41etq8j5ggeis9ivo0.yiaas.net1.yiaas.ovn                                                       168.50.8.3               00:00:00:52:22:AC   iaas-cms-ctrl-1   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce41hri8j5ggeis9ivqg.yiaas.net1.yiaas.ovn                                                       168.50.8.4               00:00:00:5A:55:C0   iaas-cms-ctrl-2   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce41kta8j5ggeis9ivs0.yiaas.net1.yiaas.ovn                                                       168.50.8.5               00:00:00:9A:39:09   iaas-cms-ctrl-1   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce441k28j5ggeis9ivug.yiaas.net1.yiaas.ovn                                                       168.50.8.6               00:00:00:E9:33:BB   iaas-cms-ctrl-2   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce4l38i8j5ggeis9j050.yiaas.net1.yiaas.ovn                                                       168.50.8.7               00:00:00:FC:DB:BC   iaas-cms-ctrl-1   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce4qfgq8j5ggeis9j070.yiaas.net1.yiaas.ovn                                                       168.50.8.8               00:00:00:EF:C2:10   iaas-cms-ctrl-2   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce4s0hq8j5ggeis9j0hg.yiaas.net1.yiaas.ovn                                                       168.50.8.9               00:00:00:81:B5:1B   iaas-cms-ctrl-2   subnet-cdq57ea8j5gqg4vf8ak0
vm-ce4s9qq8j5ggeis9j1lg.yiaas.net1.yiaas.ovn                                                       168.50.8.10              00:00:00:B5:37:26   iaas-cms-ctrl-2   subnet-cdq57ea8j5gqg4vf8ak0

Additional Info

bobz965 commented 1 year ago

IP 的crd的记录存在vm已删除未清理,存在vm ip crd重复。

oilbeater commented 1 year ago

可能和 https://github.com/kubeovn/kube-ovn/pull/2087 相关,可以更新这个 patch 再看

hongzhen-ma commented 1 year ago

对于开启 keep-vm-ip=true 参数的 vm,vm pod 是在 running状态删除的,IP crd 能在删除 vm pod的时候,同时删除。如果 vm 是在stopped 状态删除的,IP crd 需要等gc 回收,大概12分钟。因为stopped 状态的 vm,有可能会重新启动,所以在删除前一直保留着对应的 IP crd。

bobz965 commented 1 year ago

对于开启 keep-vm-ip=true 参数的 vm,vm pod 是在 running状态删除的,IP crd 能在删除 vm pod的时候,同时删除。如果 vm 是在stopped 状态删除的,IP crd 需要等gc 回收,大概12分钟。因为stopped 状态的 vm,有可能会重新启动,所以在删除前一直保留着对应的 IP crd。

当前这个重复的ip crd 绝对不止12分钟

bobz965 commented 1 year ago
[centos@iaas-cms-ctrl-1 ~]$ grep "Starting OVN controller" -r /var/log/kube-ovn/
/var/log/kube-ovn/kube-ovn-controller.log:I1205 14:12:20.809618       7 controller.go:461] Starting OVN controller
[centos@iaas-cms-ctrl-1 ~]$
[centos@iaas-cms-ctrl-1 ~]$ ssh iaas-cms-ctrl-2
[centos@iaas-cms-ctrl-2 ~]$ grep "Starting OVN controller" -r /var/log/kube-ovn/
[centos@iaas-cms-ctrl-2 ~]$ logout
Connection to iaas-cms-ctrl-2 closed.
[centos@iaas-cms-ctrl-1 ~]$ ssh iaas-cms-ctrl-3
[centos@iaas-cms-ctrl-3 ~]$ grep "Starting OVN controller" -r /var/log/kube-ovn/

kube-ovn-controller 最近没有持续crash的log,应该是没有持续崩溃重启

bobz965 commented 1 year ago

[root@iaas-cms-ctrl-1 ovn]# k get ip vm-ce3fr4q8j5gh613m5u50.yiaas.net1.yiaas.ovn -o yaml
apiVersion: kubeovn.io/v1
kind: IP
metadata:
  creationTimestamp: "2022-11-30T06:53:32Z"
  generation: 3
  labels:
    ovn.kubernetes.io/subnet: subnet-cdq57ea8j5gqg4vf8ak0
    subnet-cdq57ea8j5gqg4vf8ak0: ""
  name: vm-ce3fr4q8j5gh613m5u50.yiaas.net1.yiaas.ovn
  resourceVersion: "17229232"
  uid: 92614203-af24-4327-ae14-edbc9a41c771
spec:
  attachIps: []
  attachMacs: []
  attachSubnets: []
  containerID: ""
  ipAddress: 168.50.8.2
  macAddress: 00:00:00:A9:2E:06
  namespace: yiaas
  nodeName: iaas-cms-ctrl-1
  podName: vm-ce3fr4q8j5gh613m5u50
  podType: VirtualMachine
  subnet: subnet-cdq57ea8j5gqg4vf8ak0 # 子网id不一致
  v4IpAddress: 168.50.8.2
  v6IpAddress: ""
[root@iaas-cms-ctrl-1 ovn]# k get ip vm-ce418ki8j5ggeis9ivmg.yiaas.net1.yiaas.ovn -o yaml
apiVersion: kubeovn.io/v1
kind: IP
metadata:
  creationTimestamp: "2022-12-01T02:42:52Z"
  generation: 4
  labels:
    ovn.kubernetes.io/subnet: subnet-cdq57ea8j5gqg4vf8ak0
    subnet-cdq57ea8j5gqg4vf8ak0: ""
  name: vm-ce418ki8j5ggeis9ivmg.yiaas.net1.yiaas.ovn
  resourceVersion: "18109557"
  uid: 9ea4b4f9-6317-4a8a-ba1e-3afafa77db48
spec:
  attachIps: []
  attachMacs: []
  attachSubnets: []
  containerID: ""
  ipAddress: 168.50.8.2
  macAddress: 00:00:00:21:DA:C3
  namespace: yiaas
  nodeName: iaas-cms-ctrl-2
  podName: vm-ce418ki8j5ggeis9ivmg
  podType: VirtualMachine
  subnet: subnet-cdq57ea8j5gqg4vf8ak0  # 子网id不一致
  v4IpAddress: 168.50.8.2
  v6IpAddress: ""
[root@iaas-cms-ctrl-1 ovn]#

# 这个保留了很长时间了

# 这两个ip的子网不一样,所以ip才冲突了,归根应该是子网冲突
bobz965 commented 1 year ago

825fb609671fca18aad6e6f6d576f50

webhook 创建同一cidr的subnet的时候没有拦截,删除的时候却拦截了。

oilbeater commented 1 year ago

如果 k8s 开启审计日志功能可以看一下该 ip 资源的操作记录,是不是有删除后重复创建的操作

hongzhen-ma commented 1 year ago

825fb609671fca18aad6e6f6d576f50

webhook 创建同一cidr的subnet的时候没有拦截,删除的时候却拦截了。

还有其他的webhook 的问题,可以一起列一下。subnet 这个校验我确认下。

bobz965 commented 1 year ago

825fb609671fca18aad6e6f6d576f50 webhook 创建同一cidr的subnet的时候没有拦截,删除的时候却拦截了。

还有其他的webhook 的问题,可以一起列一下。subnet 这个校验我确认下。

总结我们遇到的问题:

  1. subnet cidr 只配了ip没有掩码,kube-ovn-controller 直接崩溃
  2. 同一vpc下可以通过创建或者更新导致存在两个一样subnet的子网
  3. 应限制子网exclude-ip过多,导致完全无ip可用的情况发生
bobz965 commented 1 year ago

如果 k8s 开启审计日志功能可以看一下该 ip 资源的操作记录,是不是有删除后重复创建的操作

看起来这两个重复的ip创建,相隔了20+个小时,而且不属于同一个subnet,应该不是同一个pod触发的删除后重建的动作。

审计日志这个功能正在计划中,暂无。

bobz965 commented 1 year ago
# master 分支 vm ip 也没有清理
[root@iaas-cms-ctrl-1 ~]# k get ip  | grep  168.0.0
vm-ce6s14q8j5gjlb83p58g.yiaas.net1.yiaas.ovn                                             168.0.0.2             00:00:00:A7:B8:D2   iaas-cms-ctrl-1   subnet-ce6rhna8j5gjlb83p4fg
vm-ce6tl1a8j5gjlb83p5e0.yiaas.net1.yiaas.ovn                                             168.0.0.3             00:00:00:A1:E1:70   iaas-cms-ctrl-2   subnet-ce6rhna8j5gjlb83p4fg
vm-ce7am7q8j5gjlb83p5lg.yiaas.net1.yiaas.ovn                                             168.0.0.4             00:00:00:AC:78:8F   iaas-cms-ctrl-2   subnet-ce6rhna8j5gjlb83p4fg
vpc-nat-gw-ngw-ce7a22a8j5gjlb83p5gg-0.kube-system                                        168.0.0.253           00:00:00:1E:D3:9B   iaas-cms-ctrl-3   subnet-ce6rhna8j5gjlb83p4fg
[root@iaas-cms-ctrl-1 ~]#

# 35 分钟后观察 依旧是存在的
oilbeater commented 1 year ago

是所有都会遗留么,还是批量创建删除部分没有清理?

bobz965 commented 1 year ago

是所有都会遗留么,还是批量创建删除部分没有清理?


[root@iaas-cms-ctrl-1 ~]# k get ip  | grep  168.0.0
vm-ce6s14q8j5gjlb83p58g.yiaas.net1.yiaas.ovn                                             168.0.0.2             00:00:00:A7:B8:D2   iaas-cms-ctrl-1   subnet-ce6rhna8j5gjlb83p4fg
vm-ce6tl1a8j5gjlb83p5e0.yiaas.net1.yiaas.ovn                                             168.0.0.3             00:00:00:A1:E1:70   iaas-cms-ctrl-2   subnet-ce6rhna8j5gjlb83p4fg
vm-ce7am7q8j5gjlb83p5lg.yiaas.net1.yiaas.ovn                                             168.0.0.4             00:00:00:AC:78:8F   iaas-cms-ctrl-2   subnet-ce6rhna8j5gjlb83p4fg
vpc-nat-gw-ngw-ce7a22a8j5gjlb83p5gg-0.kube-system                                        168.0.0.253           00:00:00:1E:D3:9B   iaas-cms-ctrl-3   subnet-ce6rhna8j5gjlb83p4fg
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]# k get po -A -o wide | grep 168.0.0
kube-system            vpc-nat-gw-ngw-ce7a22a8j5gjlb83p5gg-0              1/1     Running            0                3h53m   168.0.0.253    iaas-cms-ctrl-3   <none>           <none>
[root@iaas-cms-ctrl-1 ~]#

就现在集群的信息看,应该是删除的虚拟机的ip都遗留下来了, 目前虚拟机的测试都是单个单个创建的

oilbeater commented 1 year ago

是使用什么方式创建的 vm,我们在 1.9 版本上用 VirtualMachine 这个资源创建 vm ,删除这个资源后可以正常回收 ip

syang1997 commented 1 year ago

是使用什么方式创建的 vm,我们在 1.9 版本上用 VirtualMachine 这个资源创建 vm ,删除这个资源后可以正常回收 ip vm定义

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: vm-ce7am7q8j5gjlb83p5lg
namespace: iaas
spec:
dataVolumeTemplates:
- apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
annotations:
cdi.kubevirt.io/cloneStrategyOverride: copy
name: vol-ce7am7q8j5gjlb83p5k0
namespace: yiaas
spec:
pvc:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 25Gi
storageClassName: rbd.csi.ssd
volumeMode: Block
source:
pvc:
name: img-ce3faji8j5gh613m5tkg
namespace: yiaas
- apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
name: vol-ce7am7q8j5gjlb83p5kg
namespace: yiaas
spec:
pvc:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 25Gi
storageClassName: rbd.csi.ssd
volumeMode: Block
source:
blank: {}
instancetype:
kind: VirtualMachineInstancetype
name: small
revisionName: vm-ce7am7q8j5gjlb83p5lg-small-a8dee6d1-f20d-4cf2-bf1a-2fb148bd05e5-1
running: true
template:
metadata:
annotations:
kubevirt.io/hide-pod-network: "true"
net1.virtualmachine.fields.yiaas.yealink.com/network: vpc-ce6rhj28j5gjlb83p4f0
net1.yiaas.ovn.kubernetes.io/allow_live_migration: "true"
net1.yiaas.ovn.kubernetes.io/logical_switch: subnet-ce6rhna8j5gjlb83p4fg
creationTimestamp: null
spec:
accessCredentials:
- sshPublicKey:
propagationMethod:
configDrive: {}
source:
secret:
secretName: ac-cdqot2a8j5gqg4vf8bt0
dnsConfig:
nameservers:
- 168.0.0.0
dnsPolicy: ClusterFirst
domain:
devices:
disks:
- disk: {}
name: vol-ce7am7q8j5gjlb83p5k0
- disk: {}
name: vol-ce7am7q8j5gjlb83p5kg
- disk: {}
name: cdi-ce7am7q8j5gjlb83p5l0
interfaces:
- bridge: {}
name: wk
machine:
type: q35
resources: {}
networks:
- multus:
networkName: net1
name: wk
volumes:
- dataVolume:
name: vol-ce7am7q8j5gjlb83p5k0
name: vol-ce7am7q8j5gjlb83p5k0
- dataVolume:
name: vol-ce7am7q8j5gjlb83p5kg
name: vol-ce7am7q8j5gjlb83p5kg
- cloudInitConfigDrive:
userData: |
#cloud-config
ssh_pwauth: True
groups:
- admingroup: [root,sys]
users:
- name: root
gecos: Foo B. Bar
sudo: ALL=(ALL) NOPASSWD:ALL
groups: root
expiredate: '2032-09-01'
lock_passwd: false
plain_text_passwd: 123456
name: cdi-ce7am7q8j5gjlb83p5l0
syang1997 commented 1 year ago

对于开启 keep-vm-ip=true 参数的 vm,vm pod 是在 running状态删除的,IP crd 能在删除 vm pod的时候,同时删除。如果 vm 是在stopped 状态删除的,IP crd 需要等gc 回收,大概12分钟。因为stopped 状态的 vm,有可能会重新启动,所以在删除前一直保留着对应的 IP crd。

现在关机后立刻启动,vm现象为ip地址丢失,应该是关机后就将ip回收,分配了新的ip与原有ip不一致导致。

hongzhen-ma commented 1 year ago

1、subnet cidr 只配了ip没有掩码,kube-ovn-controller 直接崩溃 这个在webhook中更新了 subnet.spec.cidr 的检查,但是即使是原来的镜像,也不会出现 kube-ovn-controller crash的现象 应该是在kube-ovn-controller log中有类似报错

企业微信截图_4e5fa511-9c8f-4a17-9fc6-a2a3232334b0

2、同一vpc下可以通过创建或者更新导致存在两个一样subnet的子网 这个我看在webhook中有校验,创建或者更新子网,都有cidr 冲突的校验,测试了下也没遇到能更新成功的情况 创建冲突子网

企业微信截图_39093895-4050-444b-8074-1791a80e9841

更新子网,使cidr 冲突

企业微信截图_8870e789-5611-455a-8005-31f14f8836c2

3、应限制子网exclude-ip过多,导致完全无ip可用的情况发生 这个感觉是使用者的问题,暂时先不加校验了

hongzhen-ma commented 1 year ago

对于开启 keep-vm-ip=true 参数的 vm,vm pod 是在 running状态删除的,IP crd 能在删除 vm pod的时候,同时删除。如果 vm 是在stopped 状态删除的,IP crd 需要等gc 回收,大概12分钟。因为stopped 状态的 vm,有可能会重新启动,所以在删除前一直保留着对应的 IP crd。

现在关机后立刻启动,vm现象为ip地址丢失,应该是关机后就将ip回收,分配了新的ip与原有ip不一致导致。

对于这个现象,需要确认下,keep-vm-ip 参数是否开启了?没有开启这个参数的时候,vm 关机才会直接删除 IP crd。 再就是看下环境上 logical-switch-port 中 vm pod 对应的名称,是vm 的名称,还是也包含了 pod的名称? 开启keep-vm-ip 参数,lsp的名称应该只包含 vm的名称,而没有 vm pod 的名称

bobz965 commented 1 year ago

对于开启 keep-vm-ip=true 参数的 vm,vm pod 是在 running状态删除的,IP crd 能在删除 vm pod的时候,同时删除。如果 vm 是在stopped 状态删除的,IP crd 需要等gc 回收,大概12分钟。因为stopped 状态的 vm,有可能会重新启动,所以在删除前一直保留着对应的 IP crd。

现在关机后立刻启动,vm现象为ip地址丢失,应该是关机后就将ip回收,分配了新的ip与原有ip不一致导致。

对于这个现象,需要确认下,keep-vm-ip 参数是否开启了?没有开启这个参数的时候,vm 关机才会直接删除 IP crd。 再就是看下环境上 logical-switch-port 中 vm pod 对应的名称,是vm 的名称,还是也包含了 pod的名称? 开启keep-vm-ip 参数,lsp的名称应该只包含 vm的名称,而没有 vm pod 的名称

1. keep-vm-ip 是默认开启,我们这边也是开启的

[root@iaas-cms-ctrl-1 ~]# k get deployment -n kube-system kube-ovn-controller -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
  # ...
  name: kube-ovn-controller

spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: kube-ovn-controller
  strategy:
    rollingUpdate:
      maxSurge: 0%
      maxUnavailable: 100%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: kube-ovn-controller
        component: network
        type: infra
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: kube-ovn-controller
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - /kube-ovn/start-controller.sh
        - --default-cidr=10.16.0.0/16
        - --default-gateway=10.16.0.1
        - --default-gateway-check=true
        - --default-logical-gateway=false
        - --default-exclude-ips=
        - --node-switch-cidr=100.64.0.0/16
        - --service-cluster-ip-range=10.96.0.0/12
        - --network-type=geneve
        - --default-interface-name=
        - --default-exchange-link-name=false
        - --default-vlan-id=100
        - --ls-dnat-mod-dl-dst=true
        - --pod-nic-type=veth-pair
        - --enable-lb=true
        - --enable-np=true
        - --enable-eip-snat=true
        - --enable-external-vpc=true
        - --logtostderr=false
        - --alsologtostderr=true
        - --gc-interval=360
        - --inspect-interval=20
        - --log_file=/var/log/kube-ovn/kube-ovn-controller.log
        - --log_file_max_size=0
        - --enable-lb-svc=false
        - --keep-vm-ip=true  # 该配置目前都是默认开启的
        - --pod-default-fip-type=
        - --v=4

2. vm lsp 应该是能准确对应到vm 名称

[root@iaas-cms-ctrl-1 ~]# k get vm -A -o wide
NAMESPACE   NAME                      AGE     STATUS               READY
yiaas       vm-ce7am7q8j5gjlb83p5lg   30h     Stopped              False
yiaas       vm-ce7geua8j5gjlb83p5s0   23h     Stopped              False
yiaas       vm-ce7umpq8j5gkp4t1in6g   7h16m   ErrorUnschedulable   False
yiaas       vm-ce7uuia8j5gkp4t1in8g   6h59m   Running              True
yiaas       vm-ce7vb628j5gkp4t1indg   6h33m   Running              True
yiaas       vm-ce7vpti8j5gkp4t1ino0   6h1m    Running              True
yiaas       vm-ce82n2q8j5gkp4t1io2g   162m    Running              True
yiaas       vm-ce84mrq8j5gkp4t1ioj0   26m     Running              True
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]# k ko nbctl show | grep -C 2 vm-ce7vpti8j5gkp4t1ino0
        type: router
        router-port: vpc-ce7vii28j5gkp4t1inig-subnet-ce7viu28j5gkp4t1inj0
    port vm-ce7vpti8j5gkp4t1ino0.yiaas.net1.yiaas.ovn
        addresses: ["00:00:00:6D:C8:24 153.6.28.254"]
switch 916fc4dc-34e0-4ea6-9b67-cb5a072ecfe9 (subnet-ce7v8m28j5gkp4t1ina0)
--
        type: router
        router-port: ovn-cluster-ovn-default
    port vm-ce7vpti8j5gkp4t1ino0.yiaas
        addresses: ["00:00:00:93:99:B3 10.6.16.133"]
    port vm-ce7geua8j5gjlb83p5s0.yiaas
[root@iaas-cms-ctrl-1 ~]#
hongzhen-ma commented 1 year ago

看这个配置是没有问题的,lsp的名称也没有问题。 但是描述的问题现象,又像是没有开启 keep-vm-ip 参数的现象。 可以看下 kube-ovn-cni pod 最开始几行的log,确认下镜像的 commit 点。 我找环境换镜像也试一下

bobz965 commented 1 year ago

看这个配置是没有问题的,lsp的名称也没有问题。 但是描述的问题现象,又像是没有开启 keep-vm-ip 参数的现象。 可以看下 kube-ovn-cni pod 最开始几行的log,确认下镜像的 commit 点。 我找环境换镜像也试一下


root@iaas-cms-ctrl-1 ~]#  k get daemonset -A -o wide | grep kube-ovn-cni
kube-system      kube-ovn-cni      3         3         3       3            3           kubernetes.io/os=linux   2d1h   cni-server                                           kubeovn/kube-ovn:v1.11.0                                                                                                              app=kube-ovn-cni
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]#
[root@iaas-cms-ctrl-1 ~]# k get po -A -o wide | grep kube-ovn-cni
kube-system            kube-ovn-cni-7gsxf                                 1/1     Running            0                 2d      10.121.33.12    iaas-cms-ctrl-2   <none>           <none>
kube-system            kube-ovn-cni-nsk5n                                 1/1     Running            0                 2d      10.121.33.13    iaas-cms-ctrl-3   <none>           <none>
kube-system            kube-ovn-cni-xq4hn                                 1/1     Running            0                 2d      10.121.33.11    iaas-cms-ctrl-1   <none>           <none>
[root@iaas-cms-ctrl-1 ~]# k logs -f -n kube-system            kube-ovn-cni-7gsxf
setting sysctl variable "net.ipv4.neigh.default.gc_thresh1" to "1024"
net.ipv4.neigh.default.gc_thresh1 = 1024
setting sysctl variable "net.ipv4.neigh.default.gc_thresh2" to "2048"
net.ipv4.neigh.default.gc_thresh2 = 2048
setting sysctl variable "net.ipv4.neigh.default.gc_thresh3" to "4096"
net.ipv4.neigh.default.gc_thresh3 = 4096
setting sysctl variable "net.netfilter.nf_conntrack_tcp_be_liberal" to "1"
net.netfilter.nf_conntrack_tcp_be_liberal = 1
I1205 16:50:58.161898 3737885 cniserver.go:34]
-------------------------------------------------------------------------------
Kube-OVN:
  Version:       v1.11.0
  Build:         2022-12-03_06:43:38
  Commit:        git-86f75c8
  Go Version:    go1.19.3
  Arch:          amd64
hongzhen-ma commented 1 year ago

找了个 1.10.7 的环境验证了下

企业微信截图_1bebdf42-7c44-4c4e-907b-73f67874c344

删除 vm

企业微信截图_fe367d01-b2b3-49f3-ac66-5d30a6003568

kube-ovn-controller log 中查看到的 gc 记录

企业微信截图_fecd468c-5ea2-45f6-b160-ddb4e434a710

kube-ovn 镜像

企业微信截图_106d0be1-e06b-4ce4-b76d-d5b5695a6b8e

确实是没有能复现出来 issue描述的这个问题

hongzhen-ma commented 1 year ago

删除 running 状态的 pod 复现了问题,还需要再确认一下

hongzhen-ma commented 1 year ago
image

删除 ip crd

image