nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

Cordon 4 A100SXM4 Nodes with the least current usage #702

Open joachimweyl opened 1 month ago

joachimweyl commented 1 month ago

Motivation

We need more and more BM GPU nodes for RH. We currently are well below capacity for OpenShift usage. If we cordon nodes they will be easy to move to BM or OpenShift. By condoning them we will force the usage to the other 4 nodes and that way we have 4 nodes that are more flexible

Completion Criteria

6 nodes cordoned and ready to move to either BM or uncordon for OpenShift

Description

Completion dates

Desired - 2024-09-05 Required - TBD

joachimweyl commented 3 weeks ago

@jtriley How much effort is this to kick off? Do you have a timeframe you have it planned for?

jtriley commented 2 weeks ago

@joachimweyl looking into this now. Currently scanning the nodes for GPU workloads using:

$ oc get nodes -o name -l 'nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB' | xargs -I {} -t oc debug --as=system:admin {} -- chroot /host bash -c 'crictl exec $(crictl ps --name nvidia-driver-ctr -q) nvidia-smi --query-compute-apps=pid --format csv,noheader'
joachimweyl commented 2 weeks ago

How did this go?

jtriley commented 1 week ago

@joachimweyl I've cordoned the following 4x A100-SXM4 nodes in the prod cluster:

$ oc get nodes -l 'nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB' | grep -i schedulingdisabled
wrk-101   Ready,SchedulingDisabled   worker   173d   v1.28.11+add48d0
wrk-94    Ready,SchedulingDisabled   worker   189d   v1.28.11+add48d0
wrk-95    Ready,SchedulingDisabled   worker   189d   v1.28.11+add48d0
wrk-96    Ready,SchedulingDisabled   worker   189d   v1.28.11+add48d0

All of these nodes appear to have some user workloads active, however, I've confirmed that wrk-101 and wrk-96 have no rhods-notebook pods running. These also have no active GPU workloads on them according to nvidia-smi.

In order to remove these from the prod cluster for ESI we'll need to do one of the following:

  1. Wait for user workload pods to terminate on their own
  2. Notify folks and terminate the pods manually
  3. Drain these nodes during a future maintenance when all worker nodes would have to be rebooted anyway (e.g. udev rule update, upgrade, etc.)
jtriley commented 1 week ago

The pods running on those 4x hosts currently:

node/wrk-94 cordoned (server dry run)
error: unable to drain node "wrk-94" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-f8qn4, nvidia-gpu-operator/gpu-feature-discovery-wf4zs, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-58c9q, nvidia-gpu-operator/nvidia-dcgm-exporter-ptdxq, nvidia-gpu-operator/nvidia-dcgm-k99h5, nvidia-gpu-operator/nvidia-device-plugin-daemonset-gp6zx, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-4wg7l, nvidia-gpu-operator/nvidia-mig-manager-flr2w, nvidia-gpu-operator/nvidia-node-status-exporter-hfhvx, nvidia-gpu-operator/nvidia-operator-validator-4xzv5, openshift-cluster-node-tuning-operator/tuned-tkw8j, openshift-dns/dns-default-z7rq4, openshift-dns/node-resolver-gvxzq, openshift-image-registry/node-ca-zznc6, openshift-ingress-canary/ingress-canary-tspxc, openshift-logging/collector-hhlxt, openshift-machine-config-operator/machine-config-daemon-kfw4q, openshift-monitoring/node-exporter-h2dpm, openshift-multus/multus-68c4p, openshift-multus/multus-additional-cni-plugins-d942k, openshift-multus/network-metrics-daemon-6hkzk, openshift-network-diagnostics/network-check-target-x7lcj, openshift-nfd/nfd-worker-s57pg, openshift-nmstate/nmstate-handler-q7vvf, openshift-sdn/sdn-t7g7x, openshift-storage/csi-rbdplugin-xp6wm, seccomp-profile-installer/seccomp-profile-installer-m64kq, cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-user-workload-monitoring/thanos-ruler-user-workload-1, rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0, rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0, rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0, rhods-notebooks/jupyter-nb-hjgong-40bu-2eedu-0, rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0, rhods-notebooks/jupyter-nb-kimc-40bu-2eedu-0, rhods-notebooks/jupyter-nb-kthanasi-40bu-2eedu-0, rhods-notebooks/jupyter-nb-lferris1-40bu-2eedu-0, rhods-notebooks/jupyter-nb-lgp116-40bu-2eedu-0, rhods-notebooks/jupyter-nb-linhb-40bu-2eedu-0, rhods-notebooks/jupyter-nb-mikelel-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tsaij-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zhaoxm-40bu-2eedu-0, rustproject-3304e3/working-section-0], continuing command...
There are pending nodes to be drained:
 wrk-94
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-f8qn4, nvidia-gpu-operator/gpu-feature-discovery-wf4zs, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-58c9q, nvidia-gpu-operator/nvidia-dcgm-exporter-ptdxq, nvidia-gpu-operator/nvidia-dcgm-k99h5, nvidia-gpu-operator/nvidia-device-plugin-daemonset-gp6zx, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-4wg7l, nvidia-gpu-operator/nvidia-mig-manager-flr2w, nvidia-gpu-operator/nvidia-node-status-exporter-hfhvx, nvidia-gpu-operator/nvidia-operator-validator-4xzv5, openshift-cluster-node-tuning-operator/tuned-tkw8j, openshift-dns/dns-default-z7rq4, openshift-dns/node-resolver-gvxzq, openshift-image-registry/node-ca-zznc6, openshift-ingress-canary/ingress-canary-tspxc, openshift-logging/collector-hhlxt, openshift-machine-config-operator/machine-config-daemon-kfw4q, openshift-monitoring/node-exporter-h2dpm, openshift-multus/multus-68c4p, openshift-multus/multus-additional-cni-plugins-d942k, openshift-multus/network-metrics-daemon-6hkzk, openshift-network-diagnostics/network-check-target-x7lcj, openshift-nfd/nfd-worker-s57pg, openshift-nmstate/nmstate-handler-q7vvf, openshift-sdn/sdn-t7g7x, openshift-storage/csi-rbdplugin-xp6wm, seccomp-profile-installer/seccomp-profile-installer-m64kq
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-user-workload-monitoring/thanos-ruler-user-workload-1, rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0, rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0, rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0, rhods-notebooks/jupyter-nb-hjgong-40bu-2eedu-0, rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0, rhods-notebooks/jupyter-nb-kimc-40bu-2eedu-0, rhods-notebooks/jupyter-nb-kthanasi-40bu-2eedu-0, rhods-notebooks/jupyter-nb-lferris1-40bu-2eedu-0, rhods-notebooks/jupyter-nb-lgp116-40bu-2eedu-0, rhods-notebooks/jupyter-nb-linhb-40bu-2eedu-0, rhods-notebooks/jupyter-nb-mikelel-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tsaij-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zhaoxm-40bu-2eedu-0, rustproject-3304e3/working-section-0

node/wrk-95 cordoned (server dry run)
error: unable to drain node "wrk-95" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-rc7wz, nvidia-gpu-operator/gpu-feature-discovery-72bpf, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lgtv8, nvidia-gpu-operator/nvidia-dcgm-824mf, nvidia-gpu-operator/nvidia-dcgm-exporter-d8xv9, nvidia-gpu-operator/nvidia-device-plugin-daemonset-k9vj5, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-m6dzg, nvidia-gpu-operator/nvidia-mig-manager-qz9lf, nvidia-gpu-operator/nvidia-node-status-exporter-nvrvj, nvidia-gpu-operator/nvidia-operator-validator-jqrwq, openshift-cluster-node-tuning-operator/tuned-rvjkb, openshift-dns/dns-default-s742x, openshift-dns/node-resolver-b4zb4, openshift-image-registry/node-ca-v76q5, openshift-ingress-canary/ingress-canary-k92pz, openshift-logging/collector-xrvxj, openshift-machine-config-operator/machine-config-daemon-d26xv, openshift-monitoring/node-exporter-7htxs, openshift-multus/multus-additional-cni-plugins-vsg9t, openshift-multus/multus-bwtlz, openshift-multus/network-metrics-daemon-fhvn4, openshift-network-diagnostics/network-check-target-26thj, openshift-nfd/nfd-worker-m2n5l, openshift-nmstate/nmstate-handler-dh4k4, openshift-sdn/sdn-mfhm8, openshift-storage/csi-rbdplugin-f74x9, seccomp-profile-installer/seccomp-profile-installer-sjzm9, cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-monitoring/prometheus-k8s-0, openshift-user-workload-monitoring/prometheus-user-workload-0, openshift-user-workload-monitoring/thanos-ruler-user-workload-0, rhods-notebooks/jupyter-nb-aalaman-40bu-2eedu-0, rhods-notebooks/jupyter-nb-aryasur-40bu-2eedu-0, rhods-notebooks/jupyter-nb-celiag-40bu-2eedu-0, rhods-notebooks/jupyter-nb-ekim6535-40bu-2eedu-0, rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0, rhods-notebooks/jupyter-nb-mmeng-40bu-2eedu-0, rhods-notebooks/jupyter-nb-nickt03-40bu-2eedu-0, rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0, rhods-notebooks/jupyter-nb-sliu10-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tessat-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0, rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0, sail-24887a/redis-558f589fbb-c77ws], continuing command...
There are pending nodes to be drained:
 wrk-95
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-rc7wz, nvidia-gpu-operator/gpu-feature-discovery-72bpf, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lgtv8, nvidia-gpu-operator/nvidia-dcgm-824mf, nvidia-gpu-operator/nvidia-dcgm-exporter-d8xv9, nvidia-gpu-operator/nvidia-device-plugin-daemonset-k9vj5, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-m6dzg, nvidia-gpu-operator/nvidia-mig-manager-qz9lf, nvidia-gpu-operator/nvidia-node-status-exporter-nvrvj, nvidia-gpu-operator/nvidia-operator-validator-jqrwq, openshift-cluster-node-tuning-operator/tuned-rvjkb, openshift-dns/dns-default-s742x, openshift-dns/node-resolver-b4zb4, openshift-image-registry/node-ca-v76q5, openshift-ingress-canary/ingress-canary-k92pz, openshift-logging/collector-xrvxj, openshift-machine-config-operator/machine-config-daemon-d26xv, openshift-monitoring/node-exporter-7htxs, openshift-multus/multus-additional-cni-plugins-vsg9t, openshift-multus/multus-bwtlz, openshift-multus/network-metrics-daemon-fhvn4, openshift-network-diagnostics/network-check-target-26thj, openshift-nfd/nfd-worker-m2n5l, openshift-nmstate/nmstate-handler-dh4k4, openshift-sdn/sdn-mfhm8, openshift-storage/csi-rbdplugin-f74x9, seccomp-profile-installer/seccomp-profile-installer-sjzm9
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-monitoring/prometheus-k8s-0, openshift-user-workload-monitoring/prometheus-user-workload-0, openshift-user-workload-monitoring/thanos-ruler-user-workload-0, rhods-notebooks/jupyter-nb-aalaman-40bu-2eedu-0, rhods-notebooks/jupyter-nb-aryasur-40bu-2eedu-0, rhods-notebooks/jupyter-nb-celiag-40bu-2eedu-0, rhods-notebooks/jupyter-nb-ekim6535-40bu-2eedu-0, rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0, rhods-notebooks/jupyter-nb-mmeng-40bu-2eedu-0, rhods-notebooks/jupyter-nb-nickt03-40bu-2eedu-0, rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0, rhods-notebooks/jupyter-nb-sliu10-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tessat-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0, rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0, sail-24887a/redis-558f589fbb-c77ws

node/wrk-96 already cordoned (server dry run)
error: unable to drain node "wrk-96" due to error:[cannot delete Pods with local storage (use --delete-emptydir-data to override): ai-telemetry-cbca60/keycloak-0, koku-metrics-operator/curatordb-cluster-repo-host-0, ope-rhods-testing-1fef2f/danni-test-2-0, ope-rhods-testing-1fef2f/ja-ucsls-0, ope-rhods-testing-1fef2f/meera-utc-0, ope-rhods-testing-1fef2f/test-image-0, ope-rhods-testing-1fef2f/vnc-0, openshift-monitoring/alertmanager-main-0, cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-kvszc, nvidia-gpu-operator/gpu-feature-discovery-l5mvr, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lb6gb, nvidia-gpu-operator/nvidia-dcgm-exporter-jwqdd, nvidia-gpu-operator/nvidia-dcgm-ljf2v, nvidia-gpu-operator/nvidia-device-plugin-daemonset-nzfvq, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-jqrsg, nvidia-gpu-operator/nvidia-mig-manager-wfxlc, nvidia-gpu-operator/nvidia-node-status-exporter-x874h, nvidia-gpu-operator/nvidia-operator-validator-vg9p4, openshift-cluster-node-tuning-operator/tuned-4whll, openshift-dns/dns-default-fj764, openshift-dns/node-resolver-wnhwt, openshift-image-registry/node-ca-wnf8r, openshift-ingress-canary/ingress-canary-wg7k2, openshift-logging/collector-qsdgg, openshift-machine-config-operator/machine-config-daemon-klc7h, openshift-monitoring/node-exporter-wcrnf, openshift-multus/multus-6b9m4, openshift-multus/multus-additional-cni-plugins-wqd74, openshift-multus/network-metrics-daemon-47xxj, openshift-network-diagnostics/network-check-target-bknh8, openshift-nfd/nfd-worker-d6kr4, openshift-nmstate/nmstate-handler-fjclc, openshift-sdn/sdn-67j55, openshift-storage/csi-rbdplugin-fwp2g, seccomp-profile-installer/seccomp-profile-installer-d4qhj], continuing command...
There are pending nodes to be drained:
 wrk-96
cannot delete Pods with local storage (use --delete-emptydir-data to override): ai-telemetry-cbca60/keycloak-0, koku-metrics-operator/curatordb-cluster-repo-host-0, ope-rhods-testing-1fef2f/danni-test-2-0, ope-rhods-testing-1fef2f/ja-ucsls-0, ope-rhods-testing-1fef2f/meera-utc-0, ope-rhods-testing-1fef2f/test-image-0, ope-rhods-testing-1fef2f/vnc-0, openshift-monitoring/alertmanager-main-0
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-kvszc, nvidia-gpu-operator/gpu-feature-discovery-l5mvr, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lb6gb, nvidia-gpu-operator/nvidia-dcgm-exporter-jwqdd, nvidia-gpu-operator/nvidia-dcgm-ljf2v, nvidia-gpu-operator/nvidia-device-plugin-daemonset-nzfvq, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-jqrsg, nvidia-gpu-operator/nvidia-mig-manager-wfxlc, nvidia-gpu-operator/nvidia-node-status-exporter-x874h, nvidia-gpu-operator/nvidia-operator-validator-vg9p4, openshift-cluster-node-tuning-operator/tuned-4whll, openshift-dns/dns-default-fj764, openshift-dns/node-resolver-wnhwt, openshift-image-registry/node-ca-wnf8r, openshift-ingress-canary/ingress-canary-wg7k2, openshift-logging/collector-qsdgg, openshift-machine-config-operator/machine-config-daemon-klc7h, openshift-monitoring/node-exporter-wcrnf, openshift-multus/multus-6b9m4, openshift-multus/multus-additional-cni-plugins-wqd74, openshift-multus/network-metrics-daemon-47xxj, openshift-network-diagnostics/network-check-target-bknh8, openshift-nfd/nfd-worker-d6kr4, openshift-nmstate/nmstate-handler-fjclc, openshift-sdn/sdn-67j55, openshift-storage/csi-rbdplugin-fwp2g, seccomp-profile-installer/seccomp-profile-installer-d4qhj

node/wrk-101 already cordoned (server dry run)
error: unable to drain node "wrk-101" due to error:[cannot delete Pods with local storage (use --delete-emptydir-data to override): curator-system/curatordb-cluster-00-lgb6-0, openshift-storage/object-backing-store-noobaa-pod-552cdf65, sail-24887a/comets-mongo-6c79b94448-fcg5h, cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-27n8k, nvidia-gpu-operator/gpu-feature-discovery-wktn8, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-p5b5w, nvidia-gpu-operator/nvidia-dcgm-exporter-dfhg4, nvidia-gpu-operator/nvidia-dcgm-tz7cm, nvidia-gpu-operator/nvidia-device-plugin-daemonset-mg45p, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-cgbwj, nvidia-gpu-operator/nvidia-mig-manager-8pcxq, nvidia-gpu-operator/nvidia-node-status-exporter-g26qq, nvidia-gpu-operator/nvidia-operator-validator-4bdzg, openshift-cluster-node-tuning-operator/tuned-qtk2z, openshift-dns/dns-default-4rpvn, openshift-dns/node-resolver-rrxvb, openshift-image-registry/node-ca-4n5jl, openshift-ingress-canary/ingress-canary-kqgxv, openshift-logging/collector-wzhxl, openshift-machine-config-operator/machine-config-daemon-clm97, openshift-monitoring/node-exporter-zn65c, openshift-multus/multus-additional-cni-plugins-x4wjf, openshift-multus/multus-npzsd, openshift-multus/network-metrics-daemon-mc46n, openshift-network-diagnostics/network-check-target-z2dhn, openshift-nfd/nfd-worker-smjbl, openshift-nmstate/nmstate-handler-zrfp9, openshift-sdn/sdn-fh6cf, openshift-storage/csi-rbdplugin-9lfjr, seccomp-profile-installer/seccomp-profile-installer-zn4wr], continuing command...
There are pending nodes to be drained:
 wrk-101
cannot delete Pods with local storage (use --delete-emptydir-data to override): curator-system/curatordb-cluster-00-lgb6-0, openshift-storage/object-backing-store-noobaa-pod-552cdf65, sail-24887a/comets-mongo-6c79b94448-fcg5h
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-27n8k, nvidia-gpu-operator/gpu-feature-discovery-wktn8, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-p5b5w, nvidia-gpu-operator/nvidia-dcgm-exporter-dfhg4, nvidia-gpu-operator/nvidia-dcgm-tz7cm, nvidia-gpu-operator/nvidia-device-plugin-daemonset-mg45p, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-cgbwj, nvidia-gpu-operator/nvidia-mig-manager-8pcxq, nvidia-gpu-operator/nvidia-node-status-exporter-g26qq, nvidia-gpu-operator/nvidia-operator-validator-4bdzg, openshift-cluster-node-tuning-operator/tuned-qtk2z, openshift-dns/dns-default-4rpvn, openshift-dns/node-resolver-rrxvb, openshift-image-registry/node-ca-4n5jl, openshift-ingress-canary/ingress-canary-kqgxv, openshift-logging/collector-wzhxl, openshift-machine-config-operator/machine-config-daemon-clm97, openshift-monitoring/node-exporter-zn65c, openshift-multus/multus-additional-cni-plugins-x4wjf, openshift-multus/multus-npzsd, openshift-multus/network-metrics-daemon-mc46n, openshift-network-diagnostics/network-check-target-z2dhn, openshift-nfd/nfd-worker-smjbl, openshift-nmstate/nmstate-handler-zrfp9, openshift-sdn/sdn-fh6cf, openshift-storage/csi-rbdplugin-9lfjr, seccomp-profile-installer/seccomp-profile-installer-zn4wr
joachimweyl commented 1 week ago

For now, RH has the hardware they need so we can wait to see if the pods stop by themself soon. If they do not empty by the time we do our next Maintenance let's do it then. If we need to provide GPUs to RH sooner than the next maintenance, @Milstein can you create (or if you already have one copy-paste) a short communication for these users to let them know their projects will be restarted on new nodes?

joachimweyl commented 6 days ago

@jtriley have any of the nodes cleared of pods?

jtriley commented 5 days ago

Still looking at the list of pods but looks like 2/4 have cleared - the other 2 at least have rhods notebooks still running:

node/wrk-94 already cordoned
error: unable to drain node "wrk-94" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-f8qn4, nvidia-gpu-operator/gpu-feature-discovery-wf4zs, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-58c9q, nvidia-gpu-operator/nvidia-dcgm-exporter-ptdxq, nvidia-gpu-operator/nvidia-dcgm-k99h5, nvidia-gpu-operator/nvidia-device-plugin-daemonset-gp6zx, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-4wg7l, nvidia-gpu-operator/nvidia-mig-manager-flr2w, nvidia-gpu-operator/nvidia-node-status-exporter-hfhvx, nvidia-gpu-operator/nvidia-operator-validator-4xzv5, openshift-cluster-node-tuning-operator/tuned-tkw8j, openshift-dns/dns-default-z7rq4, openshift-dns/node-resolver-gvxzq, openshift-image-registry/node-ca-zznc6, openshift-ingress-canary/ingress-canary-tspxc, openshift-logging/collector-hhlxt, openshift-machine-config-operator/machine-config-daemon-kfw4q, openshift-monitoring/node-exporter-h2dpm, openshift-multus/multus-68c4p, openshift-multus/multus-additional-cni-plugins-d942k, openshift-multus/network-metrics-daemon-6hkzk, openshift-network-diagnostics/network-check-target-x7lcj, openshift-nfd/nfd-worker-s57pg, openshift-nmstate/nmstate-handler-zmnp5, openshift-sdn/sdn-t7g7x, openshift-storage/csi-rbdplugin-xp6wm, seccomp-profile-installer/seccomp-profile-installer-m64kq, cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-user-workload-monitoring/thanos-ruler-user-workload-1, rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0, rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0, rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0, rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0], continuing command...
There are pending nodes to be drained:
 wrk-94
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-f8qn4, nvidia-gpu-operator/gpu-feature-discovery-wf4zs, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-58c9q, nvidia-gpu-operator/nvidia-dcgm-exporter-ptdxq, nvidia-gpu-operator/nvidia-dcgm-k99h5, nvidia-gpu-operator/nvidia-device-plugin-daemonset-gp6zx, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-4wg7l, nvidia-gpu-operator/nvidia-mig-manager-flr2w, nvidia-gpu-operator/nvidia-node-status-exporter-hfhvx, nvidia-gpu-operator/nvidia-operator-validator-4xzv5, openshift-cluster-node-tuning-operator/tuned-tkw8j, openshift-dns/dns-default-z7rq4, openshift-dns/node-resolver-gvxzq, openshift-image-registry/node-ca-zznc6, openshift-ingress-canary/ingress-canary-tspxc, openshift-logging/collector-hhlxt, openshift-machine-config-operator/machine-config-daemon-kfw4q, openshift-monitoring/node-exporter-h2dpm, openshift-multus/multus-68c4p, openshift-multus/multus-additional-cni-plugins-d942k, openshift-multus/network-metrics-daemon-6hkzk, openshift-network-diagnostics/network-check-target-x7lcj, openshift-nfd/nfd-worker-s57pg, openshift-nmstate/nmstate-handler-zmnp5, openshift-sdn/sdn-t7g7x, openshift-storage/csi-rbdplugin-xp6wm, seccomp-profile-installer/seccomp-profile-installer-m64kq
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-user-workload-monitoring/thanos-ruler-user-workload-1, rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0, rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0, rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0, rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0

node/wrk-95 already cordoned
error: unable to drain node "wrk-95" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-rc7wz, nvidia-gpu-operator/gpu-feature-discovery-72bpf, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lgtv8, nvidia-gpu-operator/nvidia-dcgm-824mf, nvidia-gpu-operator/nvidia-dcgm-exporter-d8xv9, nvidia-gpu-operator/nvidia-device-plugin-daemonset-k9vj5, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-m6dzg, nvidia-gpu-operator/nvidia-mig-manager-qz9lf, nvidia-gpu-operator/nvidia-node-status-exporter-nvrvj, nvidia-gpu-operator/nvidia-operator-validator-jqrwq, openshift-cluster-node-tuning-operator/tuned-rvjkb, openshift-dns/dns-default-s742x, openshift-dns/node-resolver-b4zb4, openshift-image-registry/node-ca-v76q5, openshift-ingress-canary/ingress-canary-k92pz, openshift-logging/collector-xrvxj, openshift-machine-config-operator/machine-config-daemon-d26xv, openshift-monitoring/node-exporter-7htxs, openshift-multus/multus-additional-cni-plugins-vsg9t, openshift-multus/multus-bwtlz, openshift-multus/network-metrics-daemon-fhvn4, openshift-network-diagnostics/network-check-target-26thj, openshift-nfd/nfd-worker-m2n5l, openshift-nmstate/nmstate-handler-kxdtq, openshift-sdn/sdn-mfhm8, openshift-storage/csi-rbdplugin-f74x9, seccomp-profile-installer/seccomp-profile-installer-sjzm9, cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-monitoring/prometheus-k8s-0, openshift-user-workload-monitoring/prometheus-user-workload-0, openshift-user-workload-monitoring/thanos-ruler-user-workload-0, rhods-notebooks/jupyter-nb-ekim6535-40bu-2eedu-0, rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0, rhods-notebooks/jupyter-nb-nickt03-40bu-2eedu-0, rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0, rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0, sail-24887a/redis-558f589fbb-c77ws], continuing command...
There are pending nodes to be drained:
 wrk-95
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-rc7wz, nvidia-gpu-operator/gpu-feature-discovery-72bpf, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lgtv8, nvidia-gpu-operator/nvidia-dcgm-824mf, nvidia-gpu-operator/nvidia-dcgm-exporter-d8xv9, nvidia-gpu-operator/nvidia-device-plugin-daemonset-k9vj5, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-m6dzg, nvidia-gpu-operator/nvidia-mig-manager-qz9lf, nvidia-gpu-operator/nvidia-node-status-exporter-nvrvj, nvidia-gpu-operator/nvidia-operator-validator-jqrwq, openshift-cluster-node-tuning-operator/tuned-rvjkb, openshift-dns/dns-default-s742x, openshift-dns/node-resolver-b4zb4, openshift-image-registry/node-ca-v76q5, openshift-ingress-canary/ingress-canary-k92pz, openshift-logging/collector-xrvxj, openshift-machine-config-operator/machine-config-daemon-d26xv, openshift-monitoring/node-exporter-7htxs, openshift-multus/multus-additional-cni-plugins-vsg9t, openshift-multus/multus-bwtlz, openshift-multus/network-metrics-daemon-fhvn4, openshift-network-diagnostics/network-check-target-26thj, openshift-nfd/nfd-worker-m2n5l, openshift-nmstate/nmstate-handler-kxdtq, openshift-sdn/sdn-mfhm8, openshift-storage/csi-rbdplugin-f74x9, seccomp-profile-installer/seccomp-profile-installer-sjzm9
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-monitoring/prometheus-k8s-0, openshift-user-workload-monitoring/prometheus-user-workload-0, openshift-user-workload-monitoring/thanos-ruler-user-workload-0, rhods-notebooks/jupyter-nb-ekim6535-40bu-2eedu-0, rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0, rhods-notebooks/jupyter-nb-nickt03-40bu-2eedu-0, rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0, rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0, sail-24887a/redis-558f589fbb-c77ws

node/wrk-96 already cordoned
error: unable to drain node "wrk-96" due to error:[cannot delete Pods with local storage (use --delete-emptydir-data to override): ai-telemetry-cbca60/keycloak-0, koku-metrics-operator/curatordb-cluster-repo-host-0, ope-rhods-testing-1fef2f/ja-ucsls-0, ope-rhods-testing-1fef2f/meera-utc-0, ope-rhods-testing-1fef2f/test-image-0, ope-rhods-testing-1fef2f/vnc-0, openshift-monitoring/alertmanager-main-0, cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-kvszc, nvidia-gpu-operator/gpu-feature-discovery-l5mvr, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lb6gb, nvidia-gpu-operator/nvidia-dcgm-exporter-jwqdd, nvidia-gpu-operator/nvidia-dcgm-ljf2v, nvidia-gpu-operator/nvidia-device-plugin-daemonset-nzfvq, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-jqrsg, nvidia-gpu-operator/nvidia-mig-manager-wfxlc, nvidia-gpu-operator/nvidia-node-status-exporter-x874h, nvidia-gpu-operator/nvidia-operator-validator-vg9p4, openshift-cluster-node-tuning-operator/tuned-4whll, openshift-dns/dns-default-fj764, openshift-dns/node-resolver-wnhwt, openshift-image-registry/node-ca-wnf8r, openshift-ingress-canary/ingress-canary-wg7k2, openshift-logging/collector-qsdgg, openshift-machine-config-operator/machine-config-daemon-klc7h, openshift-monitoring/node-exporter-wcrnf, openshift-multus/multus-6b9m4, openshift-multus/multus-additional-cni-plugins-wqd74, openshift-multus/network-metrics-daemon-47xxj, openshift-network-diagnostics/network-check-target-bknh8, openshift-nfd/nfd-worker-d6kr4, openshift-nmstate/nmstate-handler-x6mwm, openshift-sdn/sdn-67j55, openshift-storage/csi-rbdplugin-fwp2g, seccomp-profile-installer/seccomp-profile-installer-d4qhj], continuing command...
There are pending nodes to be drained:
 wrk-96
cannot delete Pods with local storage (use --delete-emptydir-data to override): ai-telemetry-cbca60/keycloak-0, koku-metrics-operator/curatordb-cluster-repo-host-0, ope-rhods-testing-1fef2f/ja-ucsls-0, ope-rhods-testing-1fef2f/meera-utc-0, ope-rhods-testing-1fef2f/test-image-0, ope-rhods-testing-1fef2f/vnc-0, openshift-monitoring/alertmanager-main-0
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-kvszc, nvidia-gpu-operator/gpu-feature-discovery-l5mvr, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lb6gb, nvidia-gpu-operator/nvidia-dcgm-exporter-jwqdd, nvidia-gpu-operator/nvidia-dcgm-ljf2v, nvidia-gpu-operator/nvidia-device-plugin-daemonset-nzfvq, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-jqrsg, nvidia-gpu-operator/nvidia-mig-manager-wfxlc, nvidia-gpu-operator/nvidia-node-status-exporter-x874h, nvidia-gpu-operator/nvidia-operator-validator-vg9p4, openshift-cluster-node-tuning-operator/tuned-4whll, openshift-dns/dns-default-fj764, openshift-dns/node-resolver-wnhwt, openshift-image-registry/node-ca-wnf8r, openshift-ingress-canary/ingress-canary-wg7k2, openshift-logging/collector-qsdgg, openshift-machine-config-operator/machine-config-daemon-klc7h, openshift-monitoring/node-exporter-wcrnf, openshift-multus/multus-6b9m4, openshift-multus/multus-additional-cni-plugins-wqd74, openshift-multus/network-metrics-daemon-47xxj, openshift-network-diagnostics/network-check-target-bknh8, openshift-nfd/nfd-worker-d6kr4, openshift-nmstate/nmstate-handler-x6mwm, openshift-sdn/sdn-67j55, openshift-storage/csi-rbdplugin-fwp2g, seccomp-profile-installer/seccomp-profile-installer-d4qhj

node/wrk-101 already cordoned
error: unable to drain node "wrk-101" due to error:[cannot delete Pods with local storage (use --delete-emptydir-data to override): curator-system/curatordb-cluster-00-lgb6-0, openshift-storage/object-backing-store-noobaa-pod-552cdf65, sail-24887a/comets-mongo-6c79b94448-fcg5h, cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-27n8k, nvidia-gpu-operator/gpu-feature-discovery-wktn8, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-p5b5w, nvidia-gpu-operator/nvidia-dcgm-exporter-dfhg4, nvidia-gpu-operator/nvidia-dcgm-tz7cm, nvidia-gpu-operator/nvidia-device-plugin-daemonset-mg45p, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-cgbwj, nvidia-gpu-operator/nvidia-mig-manager-8pcxq, nvidia-gpu-operator/nvidia-node-status-exporter-g26qq, nvidia-gpu-operator/nvidia-operator-validator-4bdzg, openshift-cluster-node-tuning-operator/tuned-qtk2z, openshift-dns/dns-default-4rpvn, openshift-dns/node-resolver-rrxvb, openshift-image-registry/node-ca-4n5jl, openshift-ingress-canary/ingress-canary-kqgxv, openshift-logging/collector-wzhxl, openshift-machine-config-operator/machine-config-daemon-clm97, openshift-monitoring/node-exporter-zn65c, openshift-multus/multus-additional-cni-plugins-x4wjf, openshift-multus/multus-npzsd, openshift-multus/network-metrics-daemon-mc46n, openshift-network-diagnostics/network-check-target-z2dhn, openshift-nfd/nfd-worker-smjbl, openshift-nmstate/nmstate-handler-pf6qk, openshift-sdn/sdn-fh6cf, openshift-storage/csi-rbdplugin-9lfjr, seccomp-profile-installer/seccomp-profile-installer-zn4wr], continuing command...
There are pending nodes to be drained:
 wrk-101
cannot delete Pods with local storage (use --delete-emptydir-data to override): curator-system/curatordb-cluster-00-lgb6-0, openshift-storage/object-backing-store-noobaa-pod-552cdf65, sail-24887a/comets-mongo-6c79b94448-fcg5h
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-27n8k, nvidia-gpu-operator/gpu-feature-discovery-wktn8, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-p5b5w, nvidia-gpu-operator/nvidia-dcgm-exporter-dfhg4, nvidia-gpu-operator/nvidia-dcgm-tz7cm, nvidia-gpu-operator/nvidia-device-plugin-daemonset-mg45p, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-cgbwj, nvidia-gpu-operator/nvidia-mig-manager-8pcxq, nvidia-gpu-operator/nvidia-node-status-exporter-g26qq, nvidia-gpu-operator/nvidia-operator-validator-4bdzg, openshift-cluster-node-tuning-operator/tuned-qtk2z, openshift-dns/dns-default-4rpvn, openshift-dns/node-resolver-rrxvb, openshift-image-registry/node-ca-4n5jl, openshift-ingress-canary/ingress-canary-kqgxv, openshift-logging/collector-wzhxl, openshift-machine-config-operator/machine-config-daemon-clm97, openshift-monitoring/node-exporter-zn65c, openshift-multus/multus-additional-cni-plugins-x4wjf, openshift-multus/multus-npzsd, openshift-multus/network-metrics-daemon-mc46n, openshift-network-diagnostics/network-check-target-z2dhn, openshift-nfd/nfd-worker-smjbl, openshift-nmstate/nmstate-handler-pf6qk, openshift-sdn/sdn-fh6cf, openshift-storage/csi-rbdplugin-9lfjr, seccomp-profile-installer/seccomp-profile-installer-zn4wr