kubeovn / kube-ovn

A Bridge between SDN and Cloud Native (Project under CNCF)
https://kubeovn.github.io/docs/stable/en/
Apache License 2.0
1.87k stars 433 forks source link

[BUG] need to handle the failure during allocate multiple IPs for a single pod, or it will exhausted the whole IP Pool. #4210

Closed jcshare closed 3 days ago

jcshare commented 1 week ago

Kube-OVN Version

v.1.12.17 and master

Kubernetes Version

v1.27

Operation-system/Kernel Version

"Ubuntu 20.04.6 LTS" / 5.4.0-186-generic

Description

it looks like no handling for the failure of IP allocation during create the VPC GW pod, and the whole External IP Pool get exhausted by this problem.

after done some research, I found out the root cause looks like below:

pod.go
// do the same thing as add pod
func (c *Controller) reconcileAllocateSubnets(cachedPod, pod *v1.Pod, needAllocatePodNets []*kubeovnNet) (*v1.Pod, error) {
    namespace := pod.Namespace
    name := pod.Name
    klog.Infof("sync pod %s/%s allocated", namespace, name)

    isVMPod, vmName := isVMPod(pod)
    podType := getPodType(pod)
    podName := c.getNameByPod(pod)
    // todo: isVmPod, getPodType, getNameByPod has duplicated logic

    // Avoid create lsp for already running pod in ovn-nb when controller restart
    for _, podNet := range needAllocatePodNets {
        // the subnet may changed when alloc static ip from the latter subnet after ns supports multi subnets
        v4IP, v6IP, mac, subnet, err := c.acquireAddress(pod, podNet)
        if err != nil {
            c.recorder.Eventf(pod, v1.EventTypeWarning, "AcquireAddressFailed", err.Error())
            klog.Error(err)
            return nil, err  <<<<<<<< here, need to release those allocated IP address in previous loop
        } 
                ...
       }

an vpc-gw pod info:

root@master:~# kubectl describe pod vpc-nat-gw-tanant1-pro1-vpc1-net1-gw-0 -n kube-system
Name:             vpc-nat-gw-tanant1-pro1-vpc1-net1-gw-0
Namespace:        kube-system
Priority:         0
Service Account:  default
Node:             worker2/192.168.1.118
Start Time:       Fri, 21 Jun 2024 11:13:32 +0000
Labels:           app=vpc-nat-gw-tanant1-pro1-vpc1-net1-gw
                  controller-revision-hash=vpc-nat-gw-tanant1-pro1-vpc1-net1-gw-658dfcff4
                  ovn.kubernetes.io/vpc-nat-gw=true
                  statefulset.kubernetes.io/pod-name=vpc-nat-gw-tanant1-pro1-vpc1-net1-gw-0
Annotations:      k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "kube-ovn",
                        "interface": "eth0",
                        "ips": [
                            "172.22.1.254"
                        ],
                        "mac": "00:00:00:16:07:58",
                        "default": true,
                        "dns": {},
                        "gateway": [
                            "172.22.1.1"
                        ]
                    },{
                        "name": "kube-system/ovn-vpc-external-network",
                        "interface": "net1",
                        "ips": [
                            "192.168.1.19"
                        ],
                        "mac": "02:26:74:5d:03:a3",
                        "dns": {}
                    }]

Steps To Reproduce

create and delete vpc nat gate multiple times

Current Behavior

the external IP CIDR get exhausted by this problem

Expected Behavior

nice handling for the IP allocating/releasing to avoid such a problem.

jcshare commented 1 week ago

can some expert help fix this issue as I have no authority to do it, many thanks

bobz965 commented 1 week ago

please attach the err log in the kube-ovn-controller pod about the nat gw pod allocate ip

jcshare commented 1 week ago

701 I0619 12:07:04.058569 6 ipam.go:60] allocate v4 192.168.1.10, v6 , mac for kube-system/vpc-nat-gw-gw1-vpc-1-0 from subnet ovn-vpc-external-network 702 I0619 12:07:04.071551 6 ipam.go:72] allocating static ip 10.0.1.254 from subnet net1-vpc-1 703 E0619 12:07:04.072121 6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet net1-vpc-1, err NoAvailableAddress 704 I0619 12:07:04.072830 6 ipam.go:72] allocating static ip 10.0.1.254 from subnet ovn-default 705 E0619 12:07:04.073320 6 ipam.go:89] failed to allocate static ip 10.0.1.254 for kube-system/vpc-nat-gw-gw1-vpc-1-0 706 E0619 12:07:04.073525 6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet ovn-default, err AddressOutOfRange 707 E0619 12:07:04.073788 6 pod.go:620] AddressOutOfRange 708 E0619 12:07:04.074250 6 pod.go:405] error syncing 'kube-system/vpc-nat-gw-gw1-vpc-1-0': AddressOutOfRange, requeuing 709 I0619 12:07:04.074177 6 event.go:298] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"vpc-nat-gw-gw1-vpc-1-0", UID:"d61e58b5-c8f8-4f79-86cc-4e6d8724f475", AP IVersion:"v1", ResourceVersion:"2632", FieldPath:""}): type: 'Warning' reason: 'AcquireAddressFailed' AddressOutOfRange 710 I0619 12:07:04.080417 6 pod.go:550] handle add/update pod kube-system/vpc-nat-gw-gw1-vpc-1-0 711 I0619 12:07:04.083914 6 pod.go:346] enqueue update pod kube-system/vpc-nat-gw-gw1-vpc-1-0 712 I0619 12:07:04.086506 6 pod.go:607] sync pod kube-system/vpc-nat-gw-gw1-vpc-1-0 allocated 713 I0619 12:07:04.087707 6 ipam.go:60] allocate v4 192.168.1.11, v6 , mac for kube-system/vpc-nat-gw-gw1-vpc-1-0 from subnet ovn-vpc-external-network 714 I0619 12:07:04.097194 6 ipam.go:72] allocating static ip 10.0.1.254 from subnet net1-vpc-1 715 E0619 12:07:04.097443 6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet net1-vpc-1, err NoAvailableAddress 716 I0619 12:07:04.097556 6 ipam.go:72] allocating static ip 10.0.1.254 from subnet ovn-default 717 E0619 12:07:04.097627 6 ipam.go:89] failed to allocate static ip 10.0.1.254 for kube-system/vpc-nat-gw-gw1-vpc-1-0 718 E0619 12:07:04.097700 6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet ovn-default, err AddressOutOfRange 719 E0619 12:07:04.097851 6 pod.go:620] AddressOutOfRange 720 E0619 12:07:04.097949 6 pod.go:405] error syncing 'kube-system/vpc-nat-gw-gw1-vpc-1-0': AddressOutOfRange, requeuing 721 I0619 12:07:04.098040 6 pod.go:550] handle add/update pod kube-system/vpc-nat-gw-gw1-vpc-1-0 722 I0619 12:07:04.098105 6 event.go:298] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"vpc-nat-gw-gw1-vpc-1-0", UID:"d61e58b5-c8f8-4f79-86cc-4e6d8724f475", AP IVersion:"v1", ResourceVersion:"2633", FieldPath:""}): type: 'Warning' reason: 'AcquireAddressFailed' AddressOutOfRange 723 I0619 12:07:04.101424 6 pod.go:607] sync pod kube-system/vpc-nat-gw-gw1-vpc-1-0 allocated 724 I0619 12:07:04.101533 6 ipam.go:60] allocate v4 192.168.1.12, v6 , mac for kube-system/vpc-nat-gw-gw1-vpc-1-0 from subnet ovn-vpc-external-network 725 I0619 12:07:04.107169 6 pod.go:346] enqueue update pod kube-system/vpc-nat-gw-gw1-vpc-1-0 726 I0619 12:07:04.107456 6 ipam.go:72] allocating static ip 10.0.1.254 from subnet net1-vpc-1 727 E0619 12:07:04.107559 6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet net1-vpc-1, err NoAvailableAddress 728 I0619 12:07:04.107574 6 ipam.go:72] allocating static ip 10.0.1.254 from subnet ovn-default 729 E0619 12:07:04.107581 6 ipam.go:89] failed to allocate static ip 10.0.1.254 for kube-system/vpc-nat-gw-gw1-vpc-1-0 730 E0619 12:07:04.107585 6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet ovn-default, err AddressOutOfRange 731 E0619 12:07:04.107645 6 pod.go:620] AddressOutOfRange 732 E0619 12:07:04.108065 6 pod.go:405] error syncing 'kube-system/vpc-nat-gw-gw1-vpc-1-0': AddressOutOfRange, requeuing 733 I0619 12:07:04.108083 6 pod.go:550] handle add/update pod kube-system/vpc-nat-gw-gw1-vpc-1-0 734 I0619 12:07:04.107846 6 event.go:298] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"vpc-nat-gw-gw1-vpc-1-0", UID:"d61e58b5-c8f8-4f79-86cc-4e6d8724f475", AP IVersion:"v1", ResourceVersion:"2642", FieldPath:""}): type: 'Warning' reason: 'AcquireAddressFailed' AddressOutOfRange 735 I0619 12:07:04.110791 6 pod.go:607] sync pod kube-system/vpc-nat-gw-gw1-vpc-1-0 allocated 736 I0619 12:07:04.110947 6 ipam.go:60] allocate v4 192.168.1.13, v6 , mac for kube-system/vpc-nat-gw-gw1-vpc-1-0 from subnet ovn-vpc-external-network 737 I0619 12:07:04.116781 6 ipam.go:72] allocating static ip 10.0.1.254 from subnet net1-vpc-1 738 E0619 12:07:04.116801 6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet net1-vpc-1, err NoAvailableAddress 739 I0619 12:07:04.116809 6 ipam.go:72] allocating static ip 10.0.1.254 from subnet ovn-default 740 E0619 12:07:04.116901 6 ipam.go:89] failed to allocate static ip 10.0.1.254 for kube-system/vpc-nat-gw-gw1-vpc-1-0 741 E0619 12:07:04.116916 6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet ovn-default, err AddressOutOfRange 742 E0619 12:07:04.117040 6 pod.go:620] AddressOutOfRange

bobz965 commented 1 week ago

err: failed to get static ip 10.0.1.254, mac , subnet ovn-default, err AddressOutOfRange 719 E0619 12:07:04.097851 6 pod.go:620] AddressOutOfRange

please attatch kubectl get subnet details:

jcshare commented 1 week ago

the root cause should be identified as above, we need to handle the exception gracefully. I have rebuild my setup, so paste the subnet and VPC's definition as below:

kind: Vpc
apiVersion: kubeovn.io/v1
metadata:
  name: vpc-1
spec:
  staticRoutes:
    - cidr: 0.0.0.0/0
      nextHopIP: 10.0.1.254
      policy: policyDst
  namespaces:
    - ns1
---
kind: Subnet
apiVersion: kubeovn.io/v1
metadata:
  name: net1-vpc-1
spec:
  vpc: vpc-1
  cidrBlock: 10.0.1.0/24
  protocol: IPv4
  excludeIps:
    - 10.0.1.254
  namespaces:
    - ns1
kind: VpcNatGateway
apiVersion: kubeovn.io/v1
metadata:
  name: gw-vpc-1
spec:
  vpc: vpc-1
  subnet: net1-vpc-1
  lanIp: 10.0.1.254
  selector:
    - "kubernetes.io/hostname: worker2"
    - "kubernetes.io/os: linux"
  externalSubnets:
    - ovn-vpc-external-network
ubuntu@master:~/project/debug/1.12.7/test$ kubectl get subnet
NAME                       PROVIDER                               VPC                 PROTOCOL   CIDR             PRIVATE   NAT     DEFAULT   GATEWAYTYPE   V4USED   V4AVAILABLE   V6USED   V6AVAILABLE   EXCLUDEIPS                                                   U2OINTERCONNECTIONIP
join                       ovn                                    ovn-cluster         IPv4       100.64.0.0/16    false     false   false     distributed   3        65530         0        0             ["100.64.0.1"]
ovn-default                ovn                                    ovn-cluster         IPv4       10.16.0.0/16     false     true    true      distributed   5        65528         0        0             ["10.16.0.1"]
ovn-vpc-external-network   ovn-vpc-external-network.kube-system                       IPv4       192.168.1.0/24   false     false   false     distributed   3        7             0        0             ["192.168.1.1..192.168.1.9","192.168.1.20..192.168.1.255"]
ubuntu@master:~/project/debug/1.12.7/test$ 
jcshare commented 1 week ago

per the log above,it looks exist another problem(as you mentioned ), the controller shouldn't allocate the "10.0.1.254" for ovn-default subnet

bobz965 commented 1 week ago

where is your 10.0.1.0/24 subnet ???

jcshare commented 1 week ago

where is your 10.0.1.0/24 subnet ???

could you help take a deep look at the problem? it should be easy to reproduce with my configuration above. my testbed got broken by the problem and I have rebuild it, so you cannot see the subnet in my current setup.

many thanks

jcshare commented 1 week ago

anyway, I will reproduce it and upload all the log file later, thanks

jcshare commented 1 week ago

I have reproduced it with a new VPC named "vpc-3" with related log files could you help take a look? many thanks 1.12.7-IP-Allocation-bug.zip

bobz965 commented 1 week ago

where is your 10.0.1.0/24 subnet ???

sorry, I think, when you get subnet: it shows all the subnets, but, i do not find the 10.0.1.0/24 subnet

image

jcshare commented 1 week ago

where is your 10.0.1.0/24 subnet ???

sorry, I think, when you get subnet: it shows all the subnets, but, i do not find the 10.0.1.0/24 subnet

image

could you refer to my reply above : https://github.com/kubeovn/kube-ovn/issues/4210#issuecomment-2185711409

jcshare commented 1 week ago

it looks like the problem is obviously, can we help fix it if possible? many thanks

bobz965 commented 1 week ago

you do not have the vpc subnet 10.0.1.0/24, if you use the 10.0.1.0.254, you should create it. if you use vpc3 subnet, I think you should use 10.0.3.254.

image

jcshare commented 1 week ago

"if you use vpc3 subnet, I think you should use 10.0.3.254."

yes, I'm using 10.0.3.254 for vpc3, please refer to vpc3(rather than vpc1) related configuration/debug info in the tar ball the information you mentioned was the stale configuration of vpc1(should be another issue that need to be handled)

thanks

bobz965 commented 1 week ago

image

image

hi @zhangzujian, it seems IPAM has a problem ?