kubeovn / kube-ovn

A Bridge between SDN and Cloud Native (Project under CNCF)
https://kubeovn.github.io/docs/stable/en/
Apache License 2.0
1.93k stars 438 forks source link

1.11 零散 IP 在 IPAM 释放后无法分配出来 #3634

Closed bobz965 closed 7 months ago

bobz965 commented 8 months ago

Bug Report

1.11 零散 IP 在 IPAM 释放后无法分配出来

Expected Behavior

1.11 零散 IP 在 IPAM 释放后可以分配出来

F4AB4084D2386724C4FAABC736568A83


子网显示还有 ip, 但是分配的时候 发现没有

[root@csy-wx-pm-os01-eis-node01 deployer]# k get subnet vlan-net-2020 -o yaml
apiVersion: kubeovn.io/v1
kind: Subnet
metadata:
  annotations:
    cluster.ecloud.cmss.com/resource: "true"
    eis.io/provider-network: businessnet
    eis.io/vlan_id: "2020"
  creationTimestamp: "2023-11-27T10:14:23Z"
  finalizers:
  - kube-ovn-controller
  generation: 89
  labels:
    eis.io/creator: admin
    eis.io/namespace: ""
    eis.io/provider-network: businessnet
    eis.io/purpose: share
    eis.io/subnet_type: vlan
    eis.io/vlan_name: businessnet-vlan2020
  name: vlan-net-2020
  resourceVersion: "206963244"
  selfLink: /apis/kubeovn.io/v1/subnets/vlan-net-2020
  uid: fa2b2b43-2671-40c7-8a8d-3742ffa8a49a
spec:
  acls:
  - action: drop
    direction: to-lport
    match: ip6.src==2409:8c20:1833:2000::afb:af2f && ip6.dst==2409:8c20:1833:2000::afb:af40/123
      && udp.dst==5566
    priority: 1595
  cidrBlock: 10.251.175.64/27,2409:8c20:1833:2000::afb:af40/123
  default: false
  disableGatewayCheck: true
  enableDHCP: true
  excludeIps:
  - 10.251.175.94
  - 2409:8C20:1833:2000::afb:af5E
  gateway: 10.251.175.94,2409:8C20:1833:2000::afb:af5E
  gatewayNode: ""
  gatewayType: distributed
  natOutgoing: false
  private: false
  protocol: Dual
  provider: ovn
  vlan: businessnet-vlan2020
  vpc: ovn-cluster
status:
  activateGateway: ""
  conditions: # 这个能调整到放到最后吗?
  - lastTransitionTime: "2023-11-27T10:14:24Z"
    lastUpdateTime: "2024-01-11T06:30:57Z"
    reason: ResetLogicalSwitchAclSuccess
    status: "True"
    type: Validated
  - lastTransitionTime: "2023-11-27T10:14:24Z"
    lastUpdateTime: "2023-11-27T10:14:24Z"
    reason: ResetLogicalSwitchAclSuccess
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-11-27T10:14:24Z"
    lastUpdateTime: "2023-11-27T10:14:24Z"
    message: Not Observed
    reason: Init
    status: Unknown
    type: Error
  dhcpV4OptionsUUID: b0b69378-0471-4053-a53a-e20c32ad3e8f
  dhcpV6OptionsUUID: 97e0cdc9-d4d8-4d19-a4b3-0aeadab98a6d
  u2oInterconnectionIP: ""
  u2oInterconnectionVPC: ""
  v4availableIPs: 4 # 可以看到还有 4 个 ip 能够使用
  v4usingIPs: 25
  v6availableIPs: 4
  v6usingIPs: 25
[root@csy-wx-pm-os01-eis-node01 deployer]#

# 子网的统计没有问题,确实还剩 4 个 ip
Network Range 10.251.175.64 – 10.251.175.95
32 unique addresses
Usable Range 10.251.175.65 – 10.251.175.94

[root@csy-wx-pm-os01-eis-node01 deployer]# k get ip | grep vlan-net-2020  | awk '{print $2}' | sort
# 65 未使用
10.251.175.66
10.251.175.67
10.251.175.68
10.251.175.69  
# 70  未使用
10.251.175.71
10.251.175.72
# 73  未使用
10.251.175.74
# 75  未使用
10.251.175.76
10.251.175.77
10.251.175.78
10.251.175.78
10.251.175.79
10.251.175.80
10.251.175.81
10.251.175.82
10.251.175.83
10.251.175.85
10.251.175.86
10.251.175.87
10.251.175.88
10.251.175.89
10.251.175.90
10.251.175.91
10.251.175.92
10.251.175.93

94 是网关

65 70  73  75 这几个 ip 无法分配(根据后面的log看,虚拟机删除时,ipam 确实有释放。)

[root@csy-wx-pm-os01-eis-node01 deployer]# grep release -r /var/log/kube-ovn/kube-ovn-controller.log | grep win
I0117 11:14:51.987815       7 pod.go:839] release all ip address for deleting pod test-ycx/test-win
I0117 11:14:51.997527       7 subnet.go:474] release v4 10.222.30.4 mac 00:00:00:16:5B:AD for test-ycx/test-win, add ip to released list
I0117 11:14:51.997545       7 subnet.go:474] release v4 10.251.175.88 mac 00:00:00:EC:BF:95 for test-ycx/test-win, add ip to released list
I0117 11:14:51.997557       7 subnet.go:504] release v6 2409:8c20:1833:2000::afb:af58 mac  for test-ycx/test-win, add ip to released list
I0117 11:15:35.073710       7 pod.go:839] release all ip address for deleting pod test-ycx/test-win
I0117 11:15:39.170200       7 pod.go:839] release all ip address for deleting pod test-ycx/test-win
I0117 11:15:39.206729       7 pod.go:839] release all ip address for deleting pod test-ycx/test-win
I0117 11:15:42.552069       7 pod.go:839] release all ip address for deleting pod test-ycx/test-win
I0117 11:15:51.962748       7 pod.go:839] release all ip address for deleting pod test-ycx/test-win
[root@csy-wx-pm-os01-eis-node01 deployer]#
[root@csy-wx-pm-os01-eis-node01 deployer]#
[root@csy-wx-pm-os01-eis-node01 deployer]# k get vm -A -o wide | grep test-win
nlw         test-win2003        23h     Running   True
nlw         test-win2012        23h     Running   True
nlw         test-win2012r2      23h     Running   True
nlw         test-win2016        23h     Running   True
nlw         test-win2019        23h     Running   True
test-ycx    test-win-psw        120m    Running   True
test-ycx    test-win-psw-1      114m    Running   True
test-ycx    test-win-psw-2      64m     Running   True
test-ycx    test-win-psw-3      54m     Running   True

v4availableIPs: 4 这里是对的,但是 ipam 本身的剩余 ip range 的管理存在 bug,导致没有 ip 可以分配。

重启 kube-ovn-controller 之后后,发现能创建的出来(说明 init ipam 过程没有问题),这个 bug 应该只存在于动态回收的过程中。可能是 init ipam 中执行过的流程,在释放 ipam 之后可能没有执行。

[root@csy-wx-pm-os01-eis-node01 deployer]# kgp | grep psw-1
test-ycx        virt-launcher-test-win-psw-1-g682f                      1/1     Running             0          109m    10.222.0.3      csy-wx-pm-os01-eis-node02   <none>           1/1

[root@csy-wx-pm-os01-eis-node01 deployer]# k get ip | grep psw-1
test-win-psw-1.test-ycx                                             10.222.0.3                                       00:00:00:BD:C6:3E   csy-wx-pm-os01-eis-node02   ovn-default
test-win-psw-1.test-ycx.attachnet.default.ovn                       10.251.175.65    2409:8c20:1833:2000::afb:af41   00:00:00:8D:A0:80   csy-wx-pm-os01-eis-node02   vlan-net-2020
[root@csy-wx-pm-os01-eis-node01 deployer]#

分配的是第一个 ip

###### 分配以及删除 vm ip 时的 log

[root@csy-wx-pm-os01-eis-node01 deployer]# grep "test-ycx/test-win" -r /var/log/kube-ovn/kube-ovn-controller.log
I0117 11:06:20.863607       7 subnet.go:223] allocate pod test-ycx/test-win v4 ip in the range of start: 10.251.175.88, end: 10.251.175.88
I0117 11:06:20.863644       7 subnet.go:288] allocate pod test-ycx/test-win v6 ip in the range of start: 2409:8c20:1833:2000::afb:af58, end: 2409:8c20:1833:2000::afb:af58
I0117 11:06:20.863661       7 ipam.go:51] allocate v4 10.251.175.88 v6 2409:8c20:1833:2000::afb:af58 mac 00:00:00:EC:BF:95 for test-ycx/test-win from subnet vlan-net-2020
I0117 11:06:20.888458       7 subnet.go:223] allocate pod test-ycx/test-win v4 ip in the range of start: 10.222.30.4, end: 10.222.30.31
I0117 11:06:20.888492       7 ipam.go:51] allocate v4 10.222.30.4 v6  mac 00:00:00:16:5B:AD for test-ycx/test-win from subnet ovn-default
I0117 11:14:39.153342       7 ipam.go:85] allocate v4 10.251.175.88 v6 2409:8c20:1833:2000::afb:af58 mac 00:00:00:EC:BF:95 for test-ycx/test-win
I0117 11:14:51.987815       7 pod.go:839] release all ip address for deleting pod test-ycx/test-win
I0117 11:14:51.987827       7 pod.go:861] delete cr ip 'test-win.test-ycx.attachnet.default.ovn' for pod test-ycx/test-win
I0117 11:14:51.995430       7 pod.go:861] delete cr ip 'test-win.test-ycx' for pod test-ycx/test-win
I0117 11:14:51.997527       7 subnet.go:474] release v4 10.222.30.4 mac 00:00:00:16:5B:AD for test-ycx/test-win, add ip to released list
I0117 11:14:51.997545       7 subnet.go:474] release v4 10.251.175.88 mac 00:00:00:EC:BF:95 for test-ycx/test-win, add ip to released list
I0117 11:14:51.997557       7 subnet.go:504] release v6 2409:8c20:1833:2000::afb:af58 mac  for test-ycx/test-win, add ip to released list

由于测试环境没有保存 LOG,导致无法分析这几个 IP 分配回收的过程。但是从单个 VM 分配税后的 log 来看,单个 IP 的分配和释放都是执行过的。

release 过程中,要把 ip 合并的结果打印一下,

重启 kube-ovn-controller 之后后,发现能创建的出来(说明 init ipam 过程没有问题),这个 bug 应该只存在于动态回收的过程中。可能是 init ipam 中执行过的流程,在释放 ipam 之后可能没有执行。

目前认为 vm ip 的删除和回收本身没有问题,目前优先怀疑的点是 可用 ip range 动态合并 的计算有问题。


## Actual Behavior

## Steps to Reproduce the Problem

1. Pod 和 VM 正常使用,删除,会出现零散 IP 被释放,但释放后无法分配出
1.
1.

## Additional Info

- Kubernetes version:

  **Output of `kubectl version`:**

  ```bash
  (paste your output here)

release-1.11


- operation-system/kernel version:

  **Output of `awk -F '=' '/PRETTY_NAME/ { print $2 }' /etc/os-release`:**
  **Output of `uname -r`:**

  ```bash
  (paste your output here)
bobz965 commented 7 months ago

image

image

bobz965 commented 7 months ago

image

image