libopenstorage / stork

Stork - Storage Orchestration Runtime for Kubernetes
Apache License 2.0
392 stars 89 forks source link

Stork Health monitor improvements #1834

Closed diptiranjanpx closed 2 months ago

diptiranjanpx commented 3 months ago

What type of PR is this? improvement

What this PR does / why we need it:

  1. Filtering pods wrt node instead of getting all pods to work on and looping on those to filter out the pods running in offline node.
  2. Batch delete of pods in a batch of 5.
  3. Do a fresh check of node status before issuing batch delete.
  4. Indexing spec.nodeName in informercache to filter out pods running on a node.

Does this PR change a user-facing CRD or CLI?: no

Is a release note needed?: TBD

Issue:
User Impact:
Resolution

Does this change need to be cherry-picked to a release branch?: no

Test: UTs - 2 tests from current UT are failing as these need proper mocking of driver.InspectNode() . Will push the fix.

Manual test :

pods running a node

➜  stork git:(PWX-38468) ✗ kubectl get pods --all-namespaces --field-selector spec.nodeName=ip-10-13-193-24.pwx.purestorage.com
NAMESPACE     NAME                                              READY   STATUS    RESTARTS       AGE
busy-10       busybox-deployment-6dcd7756f9-7q5r6               1/1     Running   0              25m
busy-14       busybox-deployment-6dcd7756f9-ghz9f               1/1     Running   0              25m
busy-16       busybox-deployment-6dcd7756f9-sqdnm               1/1     Running   0              41m
busy-2        busybox-deployment-6dcd7756f9-w7v7r               1/1     Running   0              25m
busy-20       busybox-deployment-654cf6f86f-zf9r2               1/1     Running   0              25m
busy-4        busybox-deployment-6dcd7756f9-zqrm7               1/1     Running   0              25m
busy-6        busybox-deployment-6dcd7756f9-z2nnz               1/1     Running   0              41m
busy-7        busybox-deployment-6dcd7756f9-bkwt9               1/1     Running   0              25m
es            elasticsearch-data-0                              1/1     Running   0              55d
es            elasticsearch-data-2                              1/1     Running   0              21d
kube-system   calico-node-n62mj                                 1/1     Running   32 (76d ago)   155d
kube-system   coredns-69766b66b-2zlsr                           1/1     Running   0              69d
kube-system   kube-proxy-jprg2                                  1/1     Running   1              155d
kube-system   local-px-int-njs5q                                1/1     Running   5 (55m ago)    70d
kube-system   nginx-proxy-ip-10-13-193-24.pwx.purestorage.com   1/1     Running   1 (76d ago)    155d
kube-system   nodelocaldns-7pm2x                                1/1     Running   1              155d
kube-system   portworx-api-kks4x                                2/2     Running   0              70d
kube-system   portworx-kvdb-cjjpx                               1/1     Running   6 (55m ago)    155d
kube-system   px-csi-ext-6c89d5d4f4-7d6fs                       4/4     Running   0              34d
kube-system   stork-996664894-2xdf9                             1/1     Running   0              4m45s
kube-system   stork-scheduler-7d5579975-5zrdc                   1/1     Running   0              34d
mysql         mysql-6d9f9d7947-b9t9l                            1/1     Running   0              68d
mysql5        mysql-6b89f6b6d4-qlxv8                            1/1     Running   0              55d
mysql8        mysql-bb9686cdb-qz7mh                             1/1     Running   0              68d

After stopping px on that node, the drivermonitor picked up and started deleting the pods

➜  ~ ks logs stork-996664894-6bwnq | grep -i "Deleting pod from node"
time="2024-08-21T11:32:25Z" level=info msg="Deleting Pod from Node ip-10-13-193-24.pwx.purestorage.com due to volume driver status: Offline (STATUS_OFFLINE)" Namespace=busy-16 PodName=busybox-deployment-6dcd7756f9-sqdnm
time="2024-08-21T11:32:25Z" level=info msg="Deleting Pod from Node ip-10-13-193-24.pwx.purestorage.com due to volume driver status: Offline (STATUS_OFFLINE)" Namespace=busy-6 PodName=busybox-deployment-6dcd7756f9-z2nnz
time="2024-08-21T11:32:25Z" level=info msg="Deleting Pod from Node ip-10-13-193-24.pwx.purestorage.com due to volume driver status: Offline (STATUS_OFFLINE)" Namespace=es PodName=elasticsearch-data-0
time="2024-08-21T11:32:25Z" level=info msg="Deleting Pod from Node ip-10-13-193-24.pwx.purestorage.com due to volume driver status: Offline (STATUS_OFFLINE)" Namespace=mysql PodName=mysql-6d9f9d7947-b9t9l
time="2024-08-21T11:32:25Z" level=info msg="Deleting Pod from Node ip-10-13-193-24.pwx.purestorage.com due to volume driver status: Offline (STATUS_OFFLINE)" Namespace=mysql5 PodName=mysql-6b89f6b6d4-qlxv8
time="2024-08-21T11:32:27Z" level=info msg="Deleting Pod from Node ip-10-13-193-24.pwx.purestorage.com due to volume driver status: Offline (STATUS_OFFLINE)" Namespace=busy-10 PodName=busybox-deployment-6dcd7756f9-7q5r6
time="2024-08-21T11:32:27Z" level=info msg="Deleting Pod from Node ip-10-13-193-24.pwx.purestorage.com due to volume driver status: Offline (STATUS_OFFLINE)" Namespace=busy-4 PodName=busybox-deployment-6dcd7756f9-zqrm7
time="2024-08-21T11:32:27Z" level=info msg="Deleting Pod from Node ip-10-13-193-24.pwx.purestorage.com due to volume driver status: Offline (STATUS_OFFLINE)" Namespace=es PodName=elasticsearch-data-2
time="2024-08-21T11:32:27Z" level=info msg="Deleting Pod from Node ip-10-13-193-24.pwx.purestorage.com due to volume driver status: Offline (STATUS_OFFLINE)" Namespace=mysql8 PodName=mysql-bb9686cdb-qz7mh
time="2024-08-21T11:32:27Z" level=info msg="Deleting Pod from Node ip-10-13-193-24.pwx.purestorage.com due to volume driver status: Offline (STATUS_OFFLINE)" Namespace=busy-2 PodName=busybox-deployment-6dcd7756f9-w7v7r
time="2024-08-21T11:32:30Z" level=info msg="Deleting Pod from Node ip-10-13-193-24.pwx.purestorage.com due to volume driver status: Offline (STATUS_OFFLINE)" Namespace=busy-14 PodName=busybox-deployment-6dcd7756f9-ghz9f
time="2024-08-21T11:32:30Z" level=info msg="Deleting Pod from Node ip-10-13-193-24.pwx.purestorage.com due to volume driver status: Offline (STATUS_OFFLINE)" Namespace=busy-20 PodName=busybox-deployment-654cf6f86f-zf9r2
time="2024-08-21T11:32:30Z" level=info msg="Deleting Pod from Node ip-10-13-193-24.pwx.purestorage.com due to volume driver status: Offline (STATUS_OFFLINE)" Namespace=busy-7 PodName=busybox-deployment-6dcd7756f9-bkwt9

After pods using px volumes got evicted from the offline node

➜  stork git:(PWX-38468) ✗ kubectl get pods --all-namespaces --field-selector spec.nodeName=ip-10-13-193-24.pwx.purestorage.com
NAMESPACE     NAME                                              READY   STATUS    RESTARTS       AGE
kube-system   calico-node-n62mj                                 1/1     Running   32 (76d ago)   155d
kube-system   coredns-69766b66b-2zlsr                           1/1     Running   0              69d
kube-system   kube-proxy-jprg2                                  1/1     Running   1              155d
kube-system   local-px-int-njs5q                                0/1     Running   6 (2m4s ago)   70d
kube-system   nginx-proxy-ip-10-13-193-24.pwx.purestorage.com   1/1     Running   1 (76d ago)    155d
kube-system   nodelocaldns-7pm2x                                1/1     Running   1              155d
kube-system   portworx-api-kks4x                                1/2     Running   0              70d
kube-system   portworx-kvdb-cjjpx                               0/1     Running   7 (101s ago)   155d
kube-system   stork-996664894-2xdf9                             1/1     Running   0              17m
kube-system   stork-scheduler-7d5579975-5zrdc                   1/1     Running   0              34d

UT:

➜  stork git:(PWX-38468) ✗ /usr/local/go/bin/go test -timeout 1200s -tags unittest,integrationtest  github.com/libopenstorage/stork/pkg/monitor
ok      github.com/libopenstorage/stork/pkg/monitor 663.688s

Removing testVolumeAttachmentCleanup UT as it is not working.
      1. With existing mocking, the function can only give a fixed output but need dynamic output for InspectNode output
      2. With MockDriver, GetPodsByNode is not giving the desired output as the fieldselector does not work with mockdriver
codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 76.47059% with 20 lines in your changes missing coverage. Please review.

Project coverage is 34.56%. Comparing base (9f294e6) to head (f05cb83). Report is 1 commits behind head on stork-24.3.1-fb.

Files with missing lines Patch % Lines
pkg/monitor/monitor.go 76.47% 15 Missing and 5 partials :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## stork-24.3.1-fb #1834 +/- ## =================================================== + Coverage 34.45% 34.56% +0.10% =================================================== Files 55 55 Lines 12706 12765 +59 =================================================== + Hits 4378 4412 +34 - Misses 7933 7958 +25 Partials 395 395 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

diptiranjanpx commented 2 months ago

Integration test- https://jenkins.pwx.dev.purestorage.com/job/Users/job/strivedi/job/dup-healthmonitor/2/