k8snetworkplumbingwg / whereabouts

A CNI IPAM plugin that assigns IP addresses cluster-wide
Apache License 2.0
273 stars 120 forks source link

Rebases #180: ip-reconciler: Use ContainerID instead of PodRef #399

Open dougbtv opened 6 months ago

dougbtv commented 6 months ago

This is a (somewhat manual) rebase of #180 --

It looks like it's working in my testing of it, but... I initially had some incorrect keys deleted only sometimes when I was first testing it, and getting the wrong IPs deleted?

So I was getting failures like:

Whereabouts IP reconciler
/home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/reconciler/ip_test.go:34
  reconciling an IP pool with multiple pods attached
  /home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/reconciler/ip_test.go:98
    each with IP from the same IPPool
    /home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/reconciler/ip_test.go:122
      reconciling the IPPool
      /home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/reconciler/ip_test.go:139
        should report the dead pod's IP address as deleted [It]
        /home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/reconciler/ip_test.go:146

        Expected
            <[]net.IP | len:1, cap:1>: [
                [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 255, 255, 10, 10, 10, 2],
            ]
        to equal
            <[]net.IP | len:1, cap:1>: [
                [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 255, 255, 10, 10, 10, 1],
            ]

        /home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/reconciler/ip_test.go:150

However, with whatever finagling I was doing, those have gone away... Or so I think.

So I've been testing it in a loop to repeat it a bunch of times, like:

#!/bin/bash

for i in {1..20}; do
    echo "Running test-go.sh - Attempt $i"
    ./hack/test-go.sh > /tmp/test_output_$i.txt 2>&1
    if [ $? -ne 0 ]; then
        echo "Test failed on attempt $i, output saved to /tmp/test_output_$i.txt"
    else
        rm /tmp/test_output_$i.txt
        echo "Test passed on attempt $i"
    fi
done
s1061123 commented 6 months ago

I understand that the PR changes whereabout to change the identifier of allocation from podRef(Pod namespace/name) to contianer id. So how about to change reconciler and other code to match between Pod and allocation pool, i.e. garbageCollectPodIPs():pkg/controlloop/pod.go?

dougbtv commented 6 months ago

Tomo not sure I'm following... This is to change the identifier during reconciliation. Do you mean to change the identifier to pod and allocation pool? I'm probably being dense

dougbtv commented 6 months ago

I actually ran this 50 times and got 3 failures...

One I believe was a flake with a timeout starting the network controller, and then these two errors:

------------------------------
• Failure [0.001 seconds]
Whereabouts IP reconciler
/home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/reconciler/ip_test.go:37
  reconciling an IP pool with multiple pods attached
  /home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/reconciler/ip_test.go:101
    each with IP from the same IPPool
    /home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/reconciler/ip_test.go:125
      reconciling the IPPool
      /home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/reconciler/ip_test.go:142
        the IPPool should have only the IP reservation of the live pod [It]
        /home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/reconciler/ip_test.go:157

        Mismatch at key: 1
        Expected
            <string>: default/pod1
        to equal
            <string>: 

        /home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/reconciler/ip_test.go:172

and

------------------------------
• Failure [1.003 seconds]
IPControlLoop
/home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/controlloop/pod_controller_test.go:37
  a running pod on a node
  /home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/controlloop/pod_controller_test.go:61
    IPPool featuring an allocation for the pod
    /home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/controlloop/pod_controller_test.go:116
      the network attachment is available
      /home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/controlloop/pod_controller_test.go:127
        when the associated pod is deleted
        /home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/controlloop/pod_controller_test.go:160
          stale IP addresses are garbage collected [It]
          /home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/controlloop/pod_controller_test.go:165

          Timed out after 1.001s.
          the ip control loop should have removed this stale address
          Expected
              <map[string]v1alpha1.IPAllocation | len:1>: {
                  "0": {
                      ContainerID: "",
                      PodRef: "default/tiny-winy-pod",
                  },
              }
          to be empty

          /home/doug/codebase/src/github.com/dougbtv/whereabouts/pkg/controlloop/pod_controller_test.go:170
s1061123 commented 6 months ago

This is to change the identifier during reconciliation. Do you mean to change the identifier to pod and allocation pool? I'm probably being dense Should it be changed?

https://github.com/k8snetworkplumbingwg/whereabouts/blob/0d7d2be59005359f33ec2d67b0ab004f9c59f291/pkg/controlloop/pod.go#L225

dougbtv commented 6 months ago

So, in a discussion with Tomo, he demonstrated that under crio, the containerid doesn't match the pod sandbox, so, this won't work under all runtimes. So we'll need a different approach for having a uniqueid to delete on for all container runtimes.

zshi-redhat commented 5 months ago

So, in a discussion with Tomo, he demonstrated that under crio, the containerid doesn't match the pod sandbox, so, this won't work under all runtimes. So we'll need a different approach for having a uniqueid to delete on for all container runtimes.

I discussed with Tomo, it seems the containerID and sandboxID is the same on openshift with crio (I tested it on 4.14 with this PR applied)

s1061123 commented 5 months ago

I discussed with Tomo, it seems the containerID and sandboxID is the same on openshift with crio (I tested it on 4.14 with this PR applied)

Let me correct that. ContainerID and sandboxID is different in crio/OpenShift, but @zshi-redhat test/verified that the PR fixes the issue.

mlguerrero12 commented 5 months ago

Hi all, I had a look at this and I think this fix is not complete. First of all, the issue happens when within a pool there are 2 entries (different ips) for the same pod reference. When the ip-control-loop runs, the method findOrphanedIPsPerPool can return different values based on the order of the entries (ipReservations) of an allocation and list of pods (listing pods is not ordered). So, the function isPodAlive within findOrphanedIPsPerPool will fail for both entries if the pod with unknown status is listed first. I fixed this with some small changes in isPodAlive

func (rl ReconcileLooper) isPodAlive(podRef string, ip string) bool {
    for livePodRef, livePod := range rl.liveWhereaboutsPods {
        if podRef == livePodRef {
            isFound := isIpOnPod(&livePod, podRef, ip)
            if isFound {
                return true
            } else if !isFound && (livePod.phase == v1.PodPending) {
                /* Sometimes pods are still coming up, and may not yet have Multus
                 * annotation added to it yet. We don't want to check the IPs yet
                 * so re-fetch the Pod 5x
                 */
                podToMatch := &livePod
                retries := 0

                logging.Debugf("Re-fetching Pending Pod: %s IP-to-match: %s", livePodRef, ip)

                for retries < storage.PodRefreshRetries {
                    retries += 1
                    podToMatch = rl.refreshPod(livePodRef)
                    if podToMatch == nil {
                        logging.Debugf("Cleaning up...")
                        return false
                    } else if podToMatch.phase != v1.PodPending {
                        logging.Debugf("Pending Pod is now in phase: %s", podToMatch.phase)
                        return isIpOnPod(podToMatch, podRef, ip)
                    } else {
                        isFound = isIpOnPod(podToMatch, podRef, ip)
                        // Short-circuit - Pending Pod may have IP now
                        if isFound {
                            logging.Debugf("Pod now has IP annotation while in Pending")
                            return true
                        }
                        time.Sleep(time.Duration(500) * time.Millisecond)
                    }
                }
            }
        }
    }
    return false
}

Regarding the use of container ID, I think it's a bit weird that the person that wrote ReconcileIPPools uses podRef instead of ContainerID given that this function call IterateForDeallocation which is also used by the pod controller at the moment of deallocating. In fact, the argument of this function explicitly says containerID. Perhaps, there is a good reason for this. In any way, given that some runtimes do not have a 1 to 1 correspondence between container ID and sandbox ID, we could use the IP (together with the podRef) to identify an allocation. The Allocations are store in map whose index is the offset of the reserved IP with respect to the network range. This offset is used to calculate the IP of IPReservation (done in toIPReservationList). So, I modified ReconcileIPPools based on this. Note that I didn't modify IterateForDeallocation to not affect the pod controller but perhaps we should. It makes sense also there.

func (rl ReconcileLooper) ReconcileIPPools(ctx context.Context) ([]net.IP, error) {
    matchByPodRefAndIp := func(reservations []types.IPReservation, podRef string, ip net.IP) int {
        foundidx := -1
        for idx, v := range reservations {
            if v.PodRef == podRef && v.IP.String() == ip.String() {
                return idx
            }
        }
        return foundidx
    }

    var totalCleanedUpIps []net.IP

    for _, orphanedIP := range rl.orphanedIPs {
        currentIPReservations := orphanedIP.Pool.Allocations()
        for _, allocation := range orphanedIP.Allocations {
            pos := matchByPodRefAndIp(currentIPReservations, allocation.PodRef, allocation.IP)
            if pos < 0 {
                // Should never happen
                return nil, fmt.Errorf("did not find reserved IP for container %v", allocation.PodRef)
            }

            // Delete entry
            currentIPReservations[pos] = currentIPReservations[len(currentIPReservations)-1]
            currentIPReservations = currentIPReservations[:len(currentIPReservations)-1]

            logging.Debugf("Going to update the reserve list to: %+v", currentIPReservations)
            if err := orphanedIP.Pool.Update(ctx, currentIPReservations); err != nil {
                return nil, logging.Errorf("failed to update the reservation list: %v", err)
            }
            totalCleanedUpIps = append(totalCleanedUpIps, allocation.IP)
        }
    }

    return totalCleanedUpIps, nil
}

I tested all this in my setup and wrote a unit test as well. Disclaimer: the fake k8sclient do not support to have 2 pods with the same name. Only the last one is stored. Other scenarios were tested by manually changing the order.

    Context("reconciling an IP pool with entries from the same pod reference", func() {
        var wbClient wbclient.Interface

        var podRunning *v1.Pod
        var podTerminated *v1.Pod

        It("verifies that the entry for the pod in unknown state is the only one deleted", func() {
            By("creating 2 pods with the same name and namespace. 1st pod without network-status annotation and in an unknown state")
            podTerminated = generatePod(namespace, podName, ipInNetwork{})
            podTerminated.Status.Phase = v1.PodUnknown
            podTerminated.ObjectMeta.ResourceVersion = "10"
            k8sClientSet = fakek8sclient.NewSimpleClientset(podTerminated)

            podRunning = generatePod(namespace, podName, ipInNetwork{ip: firstIPInRange, networkName: networkName})
            k8sClientSet = fakek8sclient.NewSimpleClientset(podRunning)

            By("creating an IP pool with 2 entries from the same pod. First entry corresponds to a previous pod stuck in terminating state")
            pool := generateIPPoolSpec(ipRange, namespace, podName)
            pool.Spec.Allocations = map[string]v1alpha1.IPAllocation{
                "1": {
                    PodRef: fmt.Sprintf("%s/%s", namespace, podName),
                },
                "2": {
                    PodRef: fmt.Sprintf("%s/%s", namespace, podName),
                },
            }
            wbClient = fakewbclient.NewSimpleClientset(pool)

            By("initializing the reconciler")
            var err error
            reconcileLooper, err = NewReconcileLooperWithClient(context.TODO(), kubernetes.NewKubernetesClient(wbClient, k8sClientSet, timeout), timeout)
            Expect(err).NotTo(HaveOccurred())

            By("reconciling and checking that the correct entry is deleted")
            deletedIPAddrs, err := reconcileLooper.ReconcileIPPools(context.TODO())
            Expect(err).NotTo(HaveOccurred())
            Expect(deletedIPAddrs).To(Equal([]net.IP{net.ParseIP("10.10.10.2")}))

            By("verifying the IP pool")
            poolAfterCleanup, err := wbClient.WhereaboutsV1alpha1().IPPools(namespace).Get(context.TODO(), pool.GetName(), metav1.GetOptions{})
            Expect(err).NotTo(HaveOccurred())
            remainingAllocation := map[string]v1alpha1.IPAllocation{
                "1": {
                    PodRef: fmt.Sprintf("%s/%s", namespace, podName),
                },
            }
            Expect(poolAfterCleanup.Spec.Allocations).To(Equal(remainingAllocation))
        })
    })

@dougbtv , @s1061123 , @zshi-redhat. Please let me know if this makes sense to you.

pliurh commented 3 months ago

@mlguerrero12 I proposed a new PR to address this issue. PTAL.