spidernet-io / spiderpool

Underlay and RDMA network solution of the Kubernetes, for bare metal, VM and any public cloud
https://spidernet-io.github.io/spiderpool/
Apache License 2.0
523 stars 76 forks source link

Add a delayed IP recycling pool #3505

Closed jiayoukun closed 3 weeks ago

jiayoukun commented 4 months ago

What would you like to be added?

In a subnet with a sufficiently small IP pool, when using kubectl delete pod, Kubernetes deletes the pod and simultaneously creates a new Pod with the same metadata.name as the previous one. During this process, Kubernetes scheduling is asynchronous, meaning that the IP of the recently deleted Pod may be immediately assigned to the newly added Pod. I believe this is not very friendly for developing IP-sensitive components, such as OVS and VPP. I think a delayed IP recycling technique should be used. After releasing an IP, it should be put into a delayed recycling pool and returned to the IP pool after a certain waiting period. What do you think?

Why is this needed?

No response

How to implement it (if possible)?

No response

Additional context

No response

cyclinder commented 4 months ago

Thanks for the report @jiayoukun, this needs ack of @weizhoublue

weizhoublue commented 4 months ago

I think it is hard to tell. Thats sounds like the issue of OVS and VPP. But is that true ? is there same IP issue in Antrea and kube-ovn based on OVS ? so, is the problem relevant to how to use OVS, but not how to assign IP address ?

there is a trade-off between the delay and IP address availability 1 no matter how many available IP address is, a restarting stateful pod needs a same IP at once. The delay mechanism may make the restarting pod fails to run within dozens of seconds, and it is a disaster 2 it is hard to tell the definition of "sufficient IP in a pool ", whenever lots of pod is creating and restarting in a sudden, the IP address comes to be insufficient, which may lots of pod fail to create at the first time

so I want to figure out what the exact scenario you experienced . Or else, it is a bad idea to possibly fail creating the pod owing to insufficient IP resource, and it seems to be a problem about how to use OVS and VPP

Possibly, I presume , fixed Mac address could help you ? that could help avoid updating the ARP for the networking devices.

jiayoukun commented 4 months ago

@weizhoublue I apologize for not clearly expressing the problem I described earlier. I am referring to stateless Pods. Let me give you an example: In the kubectl delete pod operation I mentioned above, a current Pod will be deleted and added in parallel. When they are assigned the same IP, components like kube-proxy that need to deliver iptables rules for IP and port may encounter issues (out-of-order add event and del event, as they handle network rules for the same IP and port). This is because currently, external components monitor resources (Kind: Service, Endpoint) in the Kubernetes API Server and process events accordingly.

I have read the code related to the kube-ovn-ipam section. In kube-ovn, during pod creation, IP allocation is assigned to the Pod's annotation before the CNI is invoked, so it is bound to the Pod and does not encounter the issue of an IP being immediately allocated after deletion.

My current scenario is similar to the kube-proxy component, where I monitor resources in the Kubernetes API Server and deliver network rules related to IP and port, and this issue arises. The above-mentioned kube-proxy is just an example, but in my scenario, this problem does occur. I suspect that kube-proxy does not encounter this issue because the IP pool provided by Kubernetes CIDR is sufficiently large, or it relies on allocation based on recycle time when allocating IPs.

weizhoublue commented 4 months ago

@weizhoublue I apologize for not clearly expressing the problem I described earlier. I am referring to stateless Pods. Let me give you an example: In the kubectl delete pod operation I mentioned above, a current Pod will be deleted and added in parallel. When they are assigned the same IP, components like kube-proxy that need to deliver iptables rules for IP and port may encounter issues (out-of-order add event and del event, as they handle network rules for the same IP and port). This is because currently, external components monitor resources (Kind: Service, Endpoint) in the Kubernetes API Server and process events accordingly.

I have read the code related to the kube-ovn-ipam section. In kube-ovn, during pod creation, IP allocation is assigned to the Pod's annotation before the CNI is invoked, so it is bound to the Pod and does not encounter the issue of an IP being immediately allocated after deletion.

My current scenario is similar to the kube-proxy component, where I monitor resources in the Kubernetes API Server and deliver network rules related to IP and port, and this issue arises. The above-mentioned kube-proxy is just an example, but in my scenario, this problem does occur. I suspect that kube-proxy does not encounter this issue because the IP pool provided by Kubernetes CIDR is sufficiently large, or it relies on allocation based on recycle time when allocating IPs.

That sounds interesting. What result will kube-proxy produce if heppenning ? According to my understanding, kube-proxy combines unique pod ID things to identify each rule, but not just IP and port information, rule appear like "--comment "default/rdma-macvlan-ens6f1np1-v4:test" in the iptables of each endpoint, So it should cope with this. Have you found any relevant issue in the kube-proxy community ? I appreciate it if offering the issue link, I wanna dig into it.

BTW, as we know , the underlay IP is rare, for underlay CNIs like antrea or kube-von, no matter the way how ippool is specified, no matter how IPAM works. Supposing that there are just 100 IP addresses in total and 100 pods are running, then restart any pods, I completely believe that antrea or kube-von also encounters the same issue , right ? So the key is that, do they have any special delay operation for that, in IPAM and OVS components, If you know ? thanks , that may offers some advice of best practice.

jiayoukun commented 4 months ago

@weizhoublue I apologize for not clearly expressing the problem I described earlier. I am referring to stateless Pods. Let me give you an example: In the kubectl delete pod operation I mentioned above, a current Pod will be deleted and added in parallel. When they are assigned the same IP, components like kube-proxy that need to deliver iptables rules for IP and port may encounter issues (out-of-order add event and del event, as they handle network rules for the same IP and port). This is because currently, external components monitor resources (Kind: Service, Endpoint) in the Kubernetes API Server and process events accordingly. I have read the code related to the kube-ovn-ipam section. In kube-ovn, during pod creation, IP allocation is assigned to the Pod's annotation before the CNI is invoked, so it is bound to the Pod and does not encounter the issue of an IP being immediately allocated after deletion. My current scenario is similar to the kube-proxy component, where I monitor resources in the Kubernetes API Server and deliver network rules related to IP and port, and this issue arises. The above-mentioned kube-proxy is just an example, but in my scenario, this problem does occur. I suspect that kube-proxy does not encounter this issue because the IP pool provided by Kubernetes CIDR is sufficiently large, or it relies on allocation based on recycle time when allocating IPs.

That sounds interesting. What result will kube-proxy produce if heppenning ? According to my understanding, kube-proxy combines unique pod ID things to identify each rule, but not just IP and port information, rule appear like "--comment "default/rdma-macvlan-ens6f1np1-v4:test" in the iptables of each endpoint, So it should cope with this. Have you found any relevant issue in the kube-proxy community ? I appreciate it if offering the issue link, I wanna dig into it.

BTW, as we know , the underlay IP is rare, for underlay CNIs like antrea or kube-von, no matter the way how ippool is specified, no matter how IPAM works. Supposing that there are just 100 IP addresses in total and 100 pods are running, then restart any pods, I completely believe that antrea or kube-von also encounters the same issue , right ? So the key is that, do they have any special delay operation for that, in IPAM and OVS components, If you know ? thanks , that may offers some advice of best practice.

Regarding the rule delivery process of kube-proxy, let me explain it to you in more detail. The "--comment "default/rdma-macvlan-ens6f1np1-v4:test"" you mentioned, its format is "nameSpace/serviceName:servicePortsName". In an endpoint change, the content within it does not change. The purpose of this comment is only to match when the Service performs load balancing. The actual DNAT functionality is performed through the subsequent endpoint IP and PORT.

For iptables, duplicate iptables rules are allowed because its matching conditions are matched one by one from top to bottom. In Kubernetes, if the above problem occurs, I suspect it will only result in the scenario of the rules below, lasting for just a fleeting moment.

123123

The scenario I described above may have relatively few conditions for its occurrence. Its preconditions are: the IP pool is small enough, network rules do not allow duplication (e.g., route, OVS flow table), and after parallel deletion and addition of pods, they are assigned the same IP.

weizhoublue commented 4 months ago

yes, you are talking.If the new endpoint belongs to the different service, it will add in a new iptables chain , and the out-of-order does not matter. If the new endpoint belong to a same service, do you mean the kubeproxy maybe delete finally the rule owing to out-of-order events ? I figure that the kube-proxy is able to handle this case . Have you found any kube-proxy issue describing any unexpected result or bug ? I completely need a detailed issue to prove this is a big deal. if not, it should not take risks to involve more side effects that new pod maybe fail to run at the first booting. And, the underlay IP resource is limited, kube-onv or antrea must face the same issue. Is there any special bug happening or approach involved in OVS-based kube-onv or antrea ?

I have no idea what exactly you are handling , maybe you are developing an OVS related things ? so is it possible to combine pod id with ip and port to identify your rules ?

jiayoukun commented 4 months ago

yes, you are talking.If the new endpoint belongs to the different service, it will add in a new iptables chain , and the out-of-order does not matter. If the new endpoint belong to a same service, do you mean the kubeproxy maybe delete finally the rule owing to out-of-order events ? I figure that the kube-proxy is able to handle this case . Have you found any kube-proxy issue describing any unexpected result or bug ? I completely need a detailed issue to prove this is a big deal. if not, it should not take risks to involve more side effects that new pod maybe fail to run at the first booting. And, the underlay IP resource is limited, kube-onv or antrea must face the same issue. Is there any special bug happening or approach involved in OVS-based kube-onv or antrea ?

I have no idea what exactly you are handling , maybe you are developing an OVS related things ? so is it possible to combine pod id with ip and port to identify your rules ?

In my opinion, because the network rules operated by kube-proxy are iptables, and iptables allows duplication, for example, in the case of out-of-order events for the same IP, if the addition event is completed first, followed by the deletion event, then the scenario will result in the duplicate rules shown in the image above. It may last for just a fleeting moment, and once the deletion event is also completed, the iptables rules will return to normal. However, many other network rules (e.g., route, nat44 in VPP components, flow tables in OVS components) use a one-to-one map like kv at the bottom layer, which does not allow duplication. If an out-of-order event occurs (addition followed by deletion), it will actually result in the loss of network rules.

So, what am I doing now? For changes in Pod events, I need to Get or List through the SpiderIPPool resource to find the IP of the currently changing Pod and compare the PodUid with the allocatedIPs in its status to achieve data consistency of network rules during event out-of-order.

The above is my method. Although it solves the problem, the coupling with the IPAM code is too high. The purpose of raising this issue is to see if this problem can be resolved as much as possible within IPAM, or if the coupling can be optimized to the greatest extent possible. It's not to say that adding a delayed IP reclaim mechanism is the best approach.

Regarding the usage of kube-ovn, in fact, many SDNs on the market have adapted their own IPAM, such as calico-ipam and kube-ovn's logical switch. They are usually only compatible with their own components, but as an open-source IPAM, Spiderpool should have a better approach, right?

Perhaps you have a better solution? We can discuss and communicate. Thank you!

weizhoublue commented 4 months ago

@jiayoukun You finally got to the key of the matter.

But I do not think that way. OVS's port other_config allows writing many custom fields to embed information like pod uid. Ovs flow rule can use the cookie field to embed pod uid information and implement a one-to-one correspondence with pod. Based on this information beyond IP and port, it can achieve a one-to-one mapping of rules and events. When a pod is deleted , it just need to delete the ovs flow with cookie field

So the answer how to apply ovs or vpp rule could be found in the kube-ovn(ovs) and calico(vpp)

So, I think you are focusing on how to help components like OVS distribute rules now, while I am thinking about whether it is an overall optimal solution in the future. I have seen that delayed IP reclaim brings more other problems. If this is not clear, your components may not achieve the best results in the end.

I'm thinking about two things: first, why is the delayed IP reclaim feature of IPAM not available in other CNI projects? I don't think it's simply because they are only compatible with their own components. I think the reason is that they follow best practices. The robustness of components other than IPAM needs to be self-assured, and IPAM's delay cannot perfectly solve this problem. Instead, it may cause the marginal effect that business pods cannot be started. Second, all CNIs will encounter the problem of too many pods and IP exhaustion. They are all related to this problem. Why do not all projects have IP delayed reclaim function? As a third-party kube-proxy, why did not raise relevant issues in the community when adapting to these CNIs? For example, requiring components like Calico not to reallocate an IP within a short period of time? Is it a pseudo-requirement that does not conform to best practices? If so, then Spiderpool's implementation of delayed IP reclaim is not the most correct approach and cannot truly help other CNIs solve the resulting negative side effects.

jiayoukun commented 4 months ago

@jiayoukun You finally got to the key of the matter.

But I do not think that way. OVS's port other_config allows writing many custom fields to embed information like pod uid. Ovs flow rule can use the cookie field to embed pod uid information and implement a one-to-one correspondence with pod. Based on this information beyond IP and port, it can achieve a one-to-one mapping of rules and events. When a pod is deleted , it just need to delete the ovs flow with cookie field

So the answer how to apply ovs or vpp rule could be found in the kube-ovn(ovs) and calico(vpp)

So, I think you are focusing on how to help components like OVS distribute rules now, while I am thinking about whether it is an overall optimal solution in the future. I have seen that delayed IP reclaim brings more other problems. If this is not clear, your components may not achieve the best results in the end.

I'm thinking about two things: first, why is the delayed IP reclaim feature of IPAM not available in other CNI projects? I don't think it's simply because they are only compatible with their own components. I think the reason is that they follow best practices. The robustness of components other than IPAM needs to be self-assured, and IPAM's delay cannot perfectly solve this problem. Instead, it may cause the marginal effect that business pods cannot be started. Second, all CNIs will encounter the problem of too many pods and IP exhaustion. They are all related to this problem. Why do not all projects have IP delayed reclaim function? As a third-party kube-proxy, why did not raise relevant issues in the community when adapting to these CNIs? For example, requiring components like Calico not to reallocate an IP within a short period of time? Is it a pseudo-requirement that does not conform to best practices? If so, then Spiderpool's implementation of delayed IP reclaim is not the most correct approach and cannot truly help other CNIs solve the resulting negative side effects.

You are correct. The purpose of raising this issue is to find a better solution, and the delayed IP reclaim method is just an example. After all, creating a universal feature for an open-source IPAM indeed requires deeper consideration.

In fact, to solve the current problem, as long as Spiderpool ensures that the IP is reclaimed after the Pod is truly recycled, it can resolve this out-of-order issue. It's not that complicated. The reason this problem occurs now is because the recycle and creation times of containers are not equal, leading to this issue.

This is my understanding: When a Pod is deleted, it enters the terminal status. In the CNI call to cmdDel, ipam.ExecDel is invoked, and a successful return indicates that the IP has been reclaimed. However, it may actually take some time for the container to be fully recycled. This means that the IP should be considered reclaimed only when the Pod disappears completely from Kubernetes, not just when it enters the terminal status. This way, it can be ensured that the del event for the same IP occurs after the add event.

weizhoublue commented 4 months ago

I have already understood your proposal, but you keep expressing your ideas without considering my concerns. I don't think this is a good solution, the OVS should ensure its robustness like what kube-ovn or calico does. because it brings the marginal effects, such as the difficulty of controlling timeout duration and the possibility of new pods failing to start due to lack of IP addresses when IPs are suddenly in short supply. This is unacceptable in a production environment. Therefore I am wandering why this feature is not present in so many CNI projects. Therefore, if you can solve the problem of marginal effects, it will help to mature the solution.

jiayoukun commented 4 months ago

@weizhoublue I have read the code related to Calico and Kube-OVN in this regard.

Here is my basic understanding: Regarding the Calico component, when the CNI plugin calls cmdAdd, it constructs a WorkloadEndpoints object. The Felix component in Calico listens to the events of this WorkloadEndpoints object to perform operations on routing information. The listened object is WorkloadEndpoints, not the Pod object in the Kubernetes API Server. The reason why there is no out-of-order issue is that when the CNI plugin calls cmdDel, the delete event is directly executed on the WorkloadEndpoints object. The Felix component can then perform operations based on this delete event without waiting for the deletion event of the Pod resource after the container is recycled, thus resolving the out-of-order issue.

Regarding Kube-OVN, it monitors multiple resource objects such as Pod, Node, and NetworkPolicy, and eventually places them into the southbound DB or northbound DB. It uses the transaction mechanism in the DB to synchronize out-of-order IP events.

Therefore, the methods of these two components are respectively:

Triggering new resource events for rule operations at the time of CNI plugin (cmdAdd calls Add event, cmdDel calls Del event). Using external database transactions to ensure data consistency of events. This is my understanding, for reference only. Welcome to discuss.

weizhoublue commented 3 months ago

Yes, I think it's fundamental that any component should have its own robustness, a stateless application's ability to handle stateless events is fundamental, and it shouldn't have to rely on the orderiness of another component for its own flaws, such as what you might think of as IP collection latency. I believe that even with IP collection delays, you are bound to encounter out-of-order events, such as your component dealing with a large amount of accumulated events after restarting, or the network is not stable

jiayoukun commented 3 months ago

@weizhoublue So, do you have any good solutions? As an open-source IPAM component, shouldn't it provide more convenient interfaces for components that operate network rules? Otherwise, developers will have to handle event disorder in the CNI part, which is not user-friendly for users and enterprises using open-source CNI. Currently, I am using a solution where the spiderpool-client code is coupled within the network rules component to compare IPPools. Clearly, this is not a very flexible solution.

weizhoublue commented 3 months ago

I mean, even with IP collection delays, you may be bound to encounter out-of-order events, such as your component dealing with a large amount of accumulated events after restarting, or the network is not stable and all events come in sudden and in buck this work-around solution will not fix your problem completely

jiayoukun commented 3 months ago

@weizhoublue I understand your point. So, a standalone IPAM does not need to support this feature.

As far as I know, the current simple IPAM approach leverages the flexibility of Kubernetes Pod states for IP allocation: when a Pod is deleted, it enters a terminal state. For the user, the Pod is in a deleted state, but for the cluster, the IP release actually needs to happen after the Kubernetes API Server delete event (currently, it is released during the CNI cmdDel call, so the IP release occurs before the Kubernetes API Server delete event). However, this is insignificant for the user.

Moreover, for Kubernetes clusters with IP-sensitive components like kube-proxy, iptables allows for duplicate IP rules (because momentary duplicate rules due to out-of-order events won't affect rule matching or addition/deletion).

123123

Therefore, for network rules that require sequential event support, such as route rules issued by the Calico component Felix or OpenFlow Flow Tables issued by the kube-ovn component, the order of IP release and Kubernetes API Server delete events needs to be ensured.

So, is this why many mature SDNs need to handle their own IP allocation modules—because they need to control the sequence of current events?

So, is Spiderpool not planning to support this feature?

weizhoublue commented 3 weeks ago

as discussed above, I do think that way, so close the issue