turbonomic / kubeturbo

140 stars 75 forks source link

[TRB-42713] Fix the pending pod deletion logic in pod move #903

Closed irfanurrehman closed 1 year ago

irfanurrehman commented 1 year ago

Intent

This fixes a code issue with finding and deleting the pending pods created becaue of update of parent replicaset to a dummy scheduler as part of pod moves. Fixes https://jsw.ibm.com/browse/TRB-42713

This also adds few more logs to make the pod move steps more clear

Background

This is a subtle code bug because of how a loop variable is treated in golang. Some explaination below: The issue is in below existing code snippet:

func newPendingInvalidPods(newList, oldList *api.PodList) []*api.Pod {
    var diffList []*api.Pod
    if newList == nil {
        return diffList
    }
    for _, pod1 := range newList.Items {
        found := false
        if oldList != nil {
            for _, pod2 := range oldList.Items {
                if pod1.Name == pod2.Name {
                    // pod1 from newList also exists in oldList
                    found = true
                }
            }
        }
        if found == true {
            continue
        }
        if pod1.Status.Phase == api.PodPending &&
            pod1.Spec.SchedulerName == DummyScheduler {
            diffList = append(diffList, &pod1)
        }
    }

    return diffList
}

in golang below

for _, pod1 := range newList.Items {
...
}

is equivalant to pod1 being a single variable declared for the loop and the values simply update into the same variable in each loop iteration, rather then pod1 being a new variable for each loop iteration.

This means that each element (address) added to the diffList above ( diffList = append(diffList, &pod1)) points to the same loop variable, with each element pointing to the value received in the last loop iteration.

Fix is as updated in this PR

Testing

Tested manually, a sequential execution of 6 pods with the fix, no issues are observed as before

old pods

pod-move-kev-bug-verification-776b4b9648-clkrt             1/1     Running   0          25s     10.130.3.170   worker2.pt-ocp411.turbo.rtp.raleigh.ibm.com   <none>           <none>
pod-move-kev-bug-verification-776b4b9648-dtndx             1/1     Running   0          33s     10.130.3.168   worker2.pt-ocp411.turbo.rtp.raleigh.ibm.com   <none>           <none>
pod-move-kev-bug-verification-776b4b9648-ftzvl             1/1     Running   0          25s     10.130.3.169   worker2.pt-ocp411.turbo.rtp.raleigh.ibm.com   <none>           <none>
pod-move-kev-bug-verification-776b4b9648-kd8z9             1/1     Running   0          25s     10.130.3.173   worker2.pt-ocp411.turbo.rtp.raleigh.ibm.com   <none>           <none>
pod-move-kev-bug-verification-776b4b9648-m2n56             1/1     Running   0          25s     10.130.3.171   worker2.pt-ocp411.turbo.rtp.raleigh.ibm.com   <none>           <none>
pod-move-kev-bug-verification-776b4b9648-xp7wj             1/1     Running   0          25s     10.130.3.172   worker2.pt-ocp411.turbo.rtp.raleigh.ibm.com   <none>           <none>

new pods after the move

pod-move-kev-bug-verification-776b4b9648-clkrt-1es6bt706sovb   1/1     Running   0          2m37s   10.128.3.114   worker1.pt-ocp411.turbo.rtp.raleigh.ibm.com   <none>           <none>
pod-move-kev-bug-verification-776b4b9648-dtndx-1es6bucavdl7f   1/1     Running   0          117s    10.128.3.117   worker1.pt-ocp411.turbo.rtp.raleigh.ibm.com   <none>           <none>
pod-move-kev-bug-verification-776b4b9648-ftzvl-1es6bs1n9pa7g   1/1     Running   0          3m17s   10.128.3.112   worker1.pt-ocp411.turbo.rtp.raleigh.ibm.com   <none>           <none>
pod-move-kev-bug-verification-776b4b9648-kd8z9-1es6bskc4md07   1/1     Running   0          2m57s   10.128.3.113   worker1.pt-ocp411.turbo.rtp.raleigh.ibm.com   <none>           <none>
pod-move-kev-bug-verification-776b4b9648-m2n56-1es6buuvrlhlk   1/1     Running   0          97s     10.128.3.118   worker1.pt-ocp411.turbo.rtp.raleigh.ibm.com   <none>           <none>
pod-move-kev-bug-verification-776b4b9648-xp7wj-1es6btps36b2d   1/1     Running   0          2m17s   10.128.3.115   worker1.pt-ocp411.turbo.rtp.raleigh.ibm.com   <none>           <none>

watch on events

0s          Normal   AddedInterface      pod/pod-move-kev-bug-verification-776b4b9648-ftzvl-1es6bs1n9pa7g   Add eth0 [10.128.3.112/23] from openshift-sdn
0s          Normal   Pulling             pod/pod-move-kev-bug-verification-776b4b9648-ftzvl-1es6bs1n9pa7g   Pulling image "beekman9527/cpumemload:latest"
0s          Normal   Pulled              pod/pod-move-kev-bug-verification-776b4b9648-ftzvl-1es6bs1n9pa7g   Successfully pulled image "beekman9527/cpumemload:latest" in 330.490787ms
0s          Normal   Created             pod/pod-move-kev-bug-verification-776b4b9648-ftzvl-1es6bs1n9pa7g   Created container pod-move-kev-bug-verification-spec
1s          Normal   Started             pod/pod-move-kev-bug-verification-776b4b9648-ftzvl-1es6bs1n9pa7g   Started container pod-move-kev-bug-verification-spec
0s          Normal   Killing             pod/pod-move-kev-bug-verification-776b4b9648-ftzvl                 Stopping container pod-move-kev-bug-verification-spec
0s          Normal   SuccessfulCreate    replicaset/pod-move-kev-bug-verification-776b4b9648                Created pod: pod-move-kev-bug-verification-776b4b9648-6w86j
0s          Normal   SuccessfulDelete    replicaset/pod-move-kev-bug-verification-776b4b9648                Deleted pod: pod-move-kev-bug-verification-776b4b9648-6w86j
0s          Normal   AddedInterface      pod/pod-move-kev-bug-verification-776b4b9648-kd8z9-1es6bskc4md07   Add eth0 [10.128.3.113/23] from openshift-sdn
0s          Normal   Pulling             pod/pod-move-kev-bug-verification-776b4b9648-kd8z9-1es6bskc4md07   Pulling image "beekman9527/cpumemload:latest"
0s          Normal   Pulled              pod/pod-move-kev-bug-verification-776b4b9648-kd8z9-1es6bskc4md07   Successfully pulled image "beekman9527/cpumemload:latest" in 350.326649ms
0s          Normal   Created             pod/pod-move-kev-bug-verification-776b4b9648-kd8z9-1es6bskc4md07   Created container pod-move-kev-bug-verification-spec
0s          Normal   Started             pod/pod-move-kev-bug-verification-776b4b9648-kd8z9-1es6bskc4md07   Started container pod-move-kev-bug-verification-spec
0s          Normal   SuccessfulCreate    replicaset/pod-move-kev-bug-verification-776b4b9648                Created pod: pod-move-kev-bug-verification-776b4b9648-qtv6n
0s          Normal   Killing             pod/pod-move-kev-bug-verification-776b4b9648-kd8z9                 Stopping container pod-move-kev-bug-verification-spec
0s          Normal   SuccessfulDelete    replicaset/pod-move-kev-bug-verification-776b4b9648                Deleted pod: pod-move-kev-bug-verification-776b4b9648-qtv6n
0s          Normal   AddedInterface      pod/pod-move-kev-bug-verification-776b4b9648-clkrt-1es6bt706sovb   Add eth0 [10.128.3.114/23] from openshift-sdn
0s          Normal   Pulling             pod/pod-move-kev-bug-verification-776b4b9648-clkrt-1es6bt706sovb   Pulling image "beekman9527/cpumemload:latest"
0s          Normal   Pulled              pod/pod-move-kev-bug-verification-776b4b9648-clkrt-1es6bt706sovb   Successfully pulled image "beekman9527/cpumemload:latest" in 328.382894ms
0s          Normal   Created             pod/pod-move-kev-bug-verification-776b4b9648-clkrt-1es6bt706sovb   Created container pod-move-kev-bug-verification-spec
0s          Normal   Started             pod/pod-move-kev-bug-verification-776b4b9648-clkrt-1es6bt706sovb   Started container pod-move-kev-bug-verification-spec
0s          Normal   Killing             pod/pod-move-kev-bug-verification-776b4b9648-clkrt                 Stopping container pod-move-kev-bug-verification-spec
0s          Normal   SuccessfulCreate    replicaset/pod-move-kev-bug-verification-776b4b9648                Created pod: pod-move-kev-bug-verification-776b4b9648-bf95m
0s          Normal   SuccessfulDelete    replicaset/pod-move-kev-bug-verification-776b4b9648                (combined from similar events): Deleted pod: pod-move-kev-bug-verification-776b4b9648-bf95m
0s          Normal   AddedInterface      pod/pod-move-kev-bug-verification-776b4b9648-xp7wj-1es6btps36b2d   Add eth0 [10.128.3.115/23] from openshift-sdn
0s          Normal   Pulling             pod/pod-move-kev-bug-verification-776b4b9648-xp7wj-1es6btps36b2d   Pulling image "beekman9527/cpumemload:latest"
0s          Normal   Pulled              pod/pod-move-kev-bug-verification-776b4b9648-xp7wj-1es6btps36b2d   Successfully pulled image "beekman9527/cpumemload:latest" in 348.836579ms
0s          Normal   Created             pod/pod-move-kev-bug-verification-776b4b9648-xp7wj-1es6btps36b2d   Created container pod-move-kev-bug-verification-spec
0s          Normal   Started             pod/pod-move-kev-bug-verification-776b4b9648-xp7wj-1es6btps36b2d   Started container pod-move-kev-bug-verification-spec
0s          Normal   Killing             pod/pod-move-kev-bug-verification-776b4b9648-xp7wj                 Stopping container pod-move-kev-bug-verification-spec
0s          Normal   SuccessfulCreate    replicaset/pod-move-kev-bug-verification-776b4b9648                (combined from similar events): Created pod: pod-move-kev-bug-verification-776b4b9648-tlm7t
0s          Normal   SuccessfulDelete    replicaset/pod-move-kev-bug-verification-776b4b9648                (combined from similar events): Deleted pod: pod-move-kev-bug-verification-776b4b9648-tlm7t
0s          Normal   AddedInterface      pod/pod-move-kev-bug-verification-776b4b9648-dtndx-1es6bucavdl7f   Add eth0 [10.128.3.117/23] from openshift-sdn
0s          Normal   Pulling             pod/pod-move-kev-bug-verification-776b4b9648-dtndx-1es6bucavdl7f   Pulling image "beekman9527/cpumemload:latest"
0s          Normal   Pulled              pod/pod-move-kev-bug-verification-776b4b9648-dtndx-1es6bucavdl7f   Successfully pulled image "beekman9527/cpumemload:latest" in 313.285579ms
0s          Normal   Created             pod/pod-move-kev-bug-verification-776b4b9648-dtndx-1es6bucavdl7f   Created container pod-move-kev-bug-verification-spec
0s          Normal   Started             pod/pod-move-kev-bug-verification-776b4b9648-dtndx-1es6bucavdl7f   Started container pod-move-kev-bug-verification-spec
0s          Normal   Killing             pod/pod-move-kev-bug-verification-776b4b9648-dtndx                 Stopping container pod-move-kev-bug-verification-spec
0s          Normal   SuccessfulCreate    replicaset/pod-move-kev-bug-verification-776b4b9648                (combined from similar events): Created pod: pod-move-kev-bug-verification-776b4b9648-4n5dm
0s          Normal   SuccessfulDelete    replicaset/pod-move-kev-bug-verification-776b4b9648                (combined from similar events): Deleted pod: pod-move-kev-bug-verification-776b4b9648-4n5dm
0s          Normal   AddedInterface      pod/pod-move-kev-bug-verification-776b4b9648-m2n56-1es6buuvrlhlk   Add eth0 [10.128.3.118/23] from openshift-sdn
0s          Normal   Pulling             pod/pod-move-kev-bug-verification-776b4b9648-m2n56-1es6buuvrlhlk   Pulling image "beekman9527/cpumemload:latest"
0s          Normal   Pulled              pod/pod-move-kev-bug-verification-776b4b9648-m2n56-1es6buuvrlhlk   Successfully pulled image "beekman9527/cpumemload:latest" in 475.361934ms
0s          Normal   Created             pod/pod-move-kev-bug-verification-776b4b9648-m2n56-1es6buuvrlhlk   Created container pod-move-kev-bug-verification-spec
0s          Normal   Started             pod/pod-move-kev-bug-verification-776b4b9648-m2n56-1es6buuvrlhlk   Started container pod-move-kev-bug-verification-spec
0s          Normal   Killing             pod/pod-move-kev-bug-verification-776b4b9648-m2n56                 Stopping container pod-move-kev-bug-verification-spec

watch on pods

pod-move-kev-bug-verification-776b4b9648-clkrt                 1/1     Terminating   0          7m31s
pod-move-kev-bug-verification-776b4b9648-clkrt-1es6bt706sovb   1/1     Running       0          12s
pod-move-kev-bug-verification-776b4b9648-dtndx                 1/1     Running       0          7m39s
pod-move-kev-bug-verification-776b4b9648-ftzvl-1es6bs1n9pa7g   1/1     Running       0          52s
pod-move-kev-bug-verification-776b4b9648-kd8z9                 1/1     Terminating   0          7m31s
pod-move-kev-bug-verification-776b4b9648-kd8z9-1es6bskc4md07   1/1     Running       0          32s
pod-move-kev-bug-verification-776b4b9648-m2n56                 1/1     Running       0          7m31s
pod-move-kev-bug-verification-776b4b9648-xp7wj                 1/1     Running       0          7m31s
pod-move-kev-bug-verification-776b4b9648-xp7wj-1es6btps36b2d   0/1     Pending       0          0s
pod-move-kev-bug-verification-776b4b9648-xp7wj-1es6btps36b2d   0/1     ContainerCreating   0          0s
pod-move-kev-bug-verification-776b4b9648-xp7wj-1es6btps36b2d   0/1     ContainerCreating   0          2s
pod-move-kev-bug-verification-776b4b9648-kd8z9                 0/1     Terminating         0          7m41s
pod-move-kev-bug-verification-776b4b9648-kd8z9                 0/1     Terminating         0          7m41s
pod-move-kev-bug-verification-776b4b9648-kd8z9                 0/1     Terminating         0          7m41s
pod-move-kev-bug-verification-776b4b9648-xp7wj-1es6btps36b2d   1/1     Running             0          3s
pod-move-kev-bug-verification-776b4b9648-xp7wj                 1/1     Terminating         0          7m49s
pod-move-kev-bug-verification-776b4b9648-xp7wj-1es6btps36b2d   1/1     Running             0          10s
pod-move-kev-bug-verification-776b4b9648-tlm7t                 0/1     Pending             0          0s
pod-move-kev-bug-verification-776b4b9648-xp7wj-1es6btps36b2d   1/1     Running             0          10s
pod-move-kev-bug-verification-776b4b9648-tlm7t                 0/1     Terminating         0          0s
pod-move-kev-bug-verification-776b4b9648-tlm7t                 0/1     Terminating         0          0s
pod-move-kev-bug-verification-776b4b9648-dtndx-1es6bucavdl7f   0/1     Pending             0          0s
pod-move-kev-bug-verification-776b4b9648-dtndx-1es6bucavdl7f   0/1     ContainerCreating   0          0s
pod-move-kev-bug-verification-776b4b9648-dtndx-1es6bucavdl7f   0/1     ContainerCreating   0          2s
pod-move-kev-bug-verification-776b4b9648-clkrt                 0/1     Terminating         0          8m1s
pod-move-kev-bug-verification-776b4b9648-clkrt                 0/1     Terminating         0          8m1s
pod-move-kev-bug-verification-776b4b9648-clkrt                 0/1     Terminating         0          8m1s
pod-move-kev-bug-verification-776b4b9648-dtndx-1es6bucavdl7f   1/1     Running             0          3s
pod-move-kev-bug-verification-776b4b9648-dtndx                 1/1     Terminating         0          8m17s
pod-move-kev-bug-verification-776b4b9648-dtndx-1es6bucavdl7f   1/1     Running             0          10s
pod-move-kev-bug-verification-776b4b9648-4n5dm                 0/1     Pending             0          0s
pod-move-kev-bug-verification-776b4b9648-dtndx-1es6bucavdl7f   1/1     Running             0          10s
pod-move-kev-bug-verification-776b4b9648-4n5dm                 0/1     Terminating         0          0s
pod-move-kev-bug-verification-776b4b9648-4n5dm                 0/1     Terminating         0          0s
pod-move-kev-bug-verification-776b4b9648-m2n56-1es6buuvrlhlk   0/1     Pending             0          0s
pod-move-kev-bug-verification-776b4b9648-m2n56-1es6buuvrlhlk   0/1     ContainerCreating   0          0s
pod-move-kev-bug-verification-776b4b9648-xp7wj                 0/1     Terminating         0          8m21s
pod-move-kev-bug-verification-776b4b9648-xp7wj                 0/1     Terminating         0          8m21s
pod-move-kev-bug-verification-776b4b9648-xp7wj                 0/1     Terminating         0          8m21s
pod-move-kev-bug-verification-776b4b9648-m2n56-1es6buuvrlhlk   0/1     ContainerCreating   0          2s
pod-move-kev-bug-verification-776b4b9648-m2n56-1es6buuvrlhlk   1/1     Running             0          4s
pod-move-kev-bug-verification-776b4b9648-m2n56                 1/1     Terminating         0          8m29s
pod-move-kev-bug-verification-776b4b9648-nb2ml                 0/1     Pending             0          0s
pod-move-kev-bug-verification-776b4b9648-m2n56-1es6buuvrlhlk   1/1     Running             0          10s
pod-move-kev-bug-verification-776b4b9648-m2n56-1es6buuvrlhlk   1/1     Running             0          10s
pod-move-kev-bug-verification-776b4b9648-nb2ml                 0/1     Terminating         0          0s
pod-move-kev-bug-verification-776b4b9648-nb2ml                 0/1     Terminating         0          0s
pod-move-kev-bug-verification-776b4b9648-dtndx                 0/1     Terminating         0          8m49s
pod-move-kev-bug-verification-776b4b9648-dtndx                 0/1     Terminating         0          8m49s
pod-move-kev-bug-verification-776b4b9648-dtndx                 0/1     Terminating         0          8m49s

pendig pods deletion logic can still find pending pods and deletes them if they are left behind

I0711 11:15:35.659404       1 move_util.go:477] Found pending pod to delete: pod-move-kev-bug-verification-776b4b9648-nb2ml
I0711 11:15:35.659427       1 move_util.go:438] Deleting pending pod pod-move-kev-bug-verification-776b4b9648-nb2ml
I0711 11:15:35.689102       1 common_util.go:92] [retry-1/3] success

Checklist

These are the items that must be done by the developer and by reviewers before the change is ready to merge. Please strikeout any items that are not applicable, but don't delete them

Audience

(@ mention any review/... groups or people that should be aware of this merge request)

kevinwangcn commented 1 year ago

This is difficult to find. Great catch. @irfanurrehman Append a pointer into the diffList slice, that pointer keeps the same but the pointed object has been changed as the iteration goes. As a result, newPendingInvalidPods returns a list with our clone pods and we kill it mistakenly. @irfanurrehman btw, could you share how you check the api-server audit log? I didn't see it in your comments on that JIRA. Thanks

irfanurrehman commented 1 year ago

@irfanurrehman btw, could you share how you check the api-server audit log? I didn't see it in your comments on that JIRA. Thanks

I used this doc from openshift -> https://docs.openshift.com/container-platform/4.10/security/audit-log-view.html

ading1977 commented 1 year ago

Just a generic comment, can we change the log level from 5 to 4? I believe 4 is usually the highest logging level we ask customer to set for debugging purposes.

irfanurrehman commented 1 year ago

Just a generic comment, can we change the log level from 5 to 4? I believe 4 is usually the highest logging level we ask customer to set for debugging purposes.

thanks for the suggestion @ading1977. Updated!