Open joachimweyl opened 1 month ago
@jtriley How much effort is this to kick off? Do you have a timeframe you have it planned for?
@joachimweyl looking into this now. Currently scanning the nodes for GPU workloads using:
$ oc get nodes -o name -l 'nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB' | xargs -I {} -t oc debug --as=system:admin {} -- chroot /host bash -c 'crictl exec $(crictl ps --name nvidia-driver-ctr -q) nvidia-smi --query-compute-apps=pid --format csv,noheader'
How did this go?
@joachimweyl I've cordoned the following 4x A100-SXM4
nodes in the prod cluster:
$ oc get nodes -l 'nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB' | grep -i schedulingdisabled
wrk-101 Ready,SchedulingDisabled worker 173d v1.28.11+add48d0
wrk-94 Ready,SchedulingDisabled worker 189d v1.28.11+add48d0
wrk-95 Ready,SchedulingDisabled worker 189d v1.28.11+add48d0
wrk-96 Ready,SchedulingDisabled worker 189d v1.28.11+add48d0
All of these nodes appear to have some user workloads active, however, I've confirmed that wrk-101
and wrk-96
have no rhods-notebook
pods running. These also have no active GPU workloads on them according to nvidia-smi
.
In order to remove these from the prod cluster for ESI we'll need to do one of the following:
The pods running on those 4x hosts currently:
node/wrk-94 cordoned (server dry run)
error: unable to drain node "wrk-94" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-f8qn4, nvidia-gpu-operator/gpu-feature-discovery-wf4zs, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-58c9q, nvidia-gpu-operator/nvidia-dcgm-exporter-ptdxq, nvidia-gpu-operator/nvidia-dcgm-k99h5, nvidia-gpu-operator/nvidia-device-plugin-daemonset-gp6zx, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-4wg7l, nvidia-gpu-operator/nvidia-mig-manager-flr2w, nvidia-gpu-operator/nvidia-node-status-exporter-hfhvx, nvidia-gpu-operator/nvidia-operator-validator-4xzv5, openshift-cluster-node-tuning-operator/tuned-tkw8j, openshift-dns/dns-default-z7rq4, openshift-dns/node-resolver-gvxzq, openshift-image-registry/node-ca-zznc6, openshift-ingress-canary/ingress-canary-tspxc, openshift-logging/collector-hhlxt, openshift-machine-config-operator/machine-config-daemon-kfw4q, openshift-monitoring/node-exporter-h2dpm, openshift-multus/multus-68c4p, openshift-multus/multus-additional-cni-plugins-d942k, openshift-multus/network-metrics-daemon-6hkzk, openshift-network-diagnostics/network-check-target-x7lcj, openshift-nfd/nfd-worker-s57pg, openshift-nmstate/nmstate-handler-q7vvf, openshift-sdn/sdn-t7g7x, openshift-storage/csi-rbdplugin-xp6wm, seccomp-profile-installer/seccomp-profile-installer-m64kq, cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-user-workload-monitoring/thanos-ruler-user-workload-1, rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0, rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0, rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0, rhods-notebooks/jupyter-nb-hjgong-40bu-2eedu-0, rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0, rhods-notebooks/jupyter-nb-kimc-40bu-2eedu-0, rhods-notebooks/jupyter-nb-kthanasi-40bu-2eedu-0, rhods-notebooks/jupyter-nb-lferris1-40bu-2eedu-0, rhods-notebooks/jupyter-nb-lgp116-40bu-2eedu-0, rhods-notebooks/jupyter-nb-linhb-40bu-2eedu-0, rhods-notebooks/jupyter-nb-mikelel-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tsaij-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zhaoxm-40bu-2eedu-0, rustproject-3304e3/working-section-0], continuing command...
There are pending nodes to be drained:
wrk-94
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-f8qn4, nvidia-gpu-operator/gpu-feature-discovery-wf4zs, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-58c9q, nvidia-gpu-operator/nvidia-dcgm-exporter-ptdxq, nvidia-gpu-operator/nvidia-dcgm-k99h5, nvidia-gpu-operator/nvidia-device-plugin-daemonset-gp6zx, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-4wg7l, nvidia-gpu-operator/nvidia-mig-manager-flr2w, nvidia-gpu-operator/nvidia-node-status-exporter-hfhvx, nvidia-gpu-operator/nvidia-operator-validator-4xzv5, openshift-cluster-node-tuning-operator/tuned-tkw8j, openshift-dns/dns-default-z7rq4, openshift-dns/node-resolver-gvxzq, openshift-image-registry/node-ca-zznc6, openshift-ingress-canary/ingress-canary-tspxc, openshift-logging/collector-hhlxt, openshift-machine-config-operator/machine-config-daemon-kfw4q, openshift-monitoring/node-exporter-h2dpm, openshift-multus/multus-68c4p, openshift-multus/multus-additional-cni-plugins-d942k, openshift-multus/network-metrics-daemon-6hkzk, openshift-network-diagnostics/network-check-target-x7lcj, openshift-nfd/nfd-worker-s57pg, openshift-nmstate/nmstate-handler-q7vvf, openshift-sdn/sdn-t7g7x, openshift-storage/csi-rbdplugin-xp6wm, seccomp-profile-installer/seccomp-profile-installer-m64kq
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-user-workload-monitoring/thanos-ruler-user-workload-1, rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0, rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0, rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0, rhods-notebooks/jupyter-nb-hjgong-40bu-2eedu-0, rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0, rhods-notebooks/jupyter-nb-kimc-40bu-2eedu-0, rhods-notebooks/jupyter-nb-kthanasi-40bu-2eedu-0, rhods-notebooks/jupyter-nb-lferris1-40bu-2eedu-0, rhods-notebooks/jupyter-nb-lgp116-40bu-2eedu-0, rhods-notebooks/jupyter-nb-linhb-40bu-2eedu-0, rhods-notebooks/jupyter-nb-mikelel-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tsaij-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zhaoxm-40bu-2eedu-0, rustproject-3304e3/working-section-0
node/wrk-95 cordoned (server dry run)
error: unable to drain node "wrk-95" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-rc7wz, nvidia-gpu-operator/gpu-feature-discovery-72bpf, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lgtv8, nvidia-gpu-operator/nvidia-dcgm-824mf, nvidia-gpu-operator/nvidia-dcgm-exporter-d8xv9, nvidia-gpu-operator/nvidia-device-plugin-daemonset-k9vj5, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-m6dzg, nvidia-gpu-operator/nvidia-mig-manager-qz9lf, nvidia-gpu-operator/nvidia-node-status-exporter-nvrvj, nvidia-gpu-operator/nvidia-operator-validator-jqrwq, openshift-cluster-node-tuning-operator/tuned-rvjkb, openshift-dns/dns-default-s742x, openshift-dns/node-resolver-b4zb4, openshift-image-registry/node-ca-v76q5, openshift-ingress-canary/ingress-canary-k92pz, openshift-logging/collector-xrvxj, openshift-machine-config-operator/machine-config-daemon-d26xv, openshift-monitoring/node-exporter-7htxs, openshift-multus/multus-additional-cni-plugins-vsg9t, openshift-multus/multus-bwtlz, openshift-multus/network-metrics-daemon-fhvn4, openshift-network-diagnostics/network-check-target-26thj, openshift-nfd/nfd-worker-m2n5l, openshift-nmstate/nmstate-handler-dh4k4, openshift-sdn/sdn-mfhm8, openshift-storage/csi-rbdplugin-f74x9, seccomp-profile-installer/seccomp-profile-installer-sjzm9, cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-monitoring/prometheus-k8s-0, openshift-user-workload-monitoring/prometheus-user-workload-0, openshift-user-workload-monitoring/thanos-ruler-user-workload-0, rhods-notebooks/jupyter-nb-aalaman-40bu-2eedu-0, rhods-notebooks/jupyter-nb-aryasur-40bu-2eedu-0, rhods-notebooks/jupyter-nb-celiag-40bu-2eedu-0, rhods-notebooks/jupyter-nb-ekim6535-40bu-2eedu-0, rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0, rhods-notebooks/jupyter-nb-mmeng-40bu-2eedu-0, rhods-notebooks/jupyter-nb-nickt03-40bu-2eedu-0, rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0, rhods-notebooks/jupyter-nb-sliu10-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tessat-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0, rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0, sail-24887a/redis-558f589fbb-c77ws], continuing command...
There are pending nodes to be drained:
wrk-95
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-rc7wz, nvidia-gpu-operator/gpu-feature-discovery-72bpf, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lgtv8, nvidia-gpu-operator/nvidia-dcgm-824mf, nvidia-gpu-operator/nvidia-dcgm-exporter-d8xv9, nvidia-gpu-operator/nvidia-device-plugin-daemonset-k9vj5, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-m6dzg, nvidia-gpu-operator/nvidia-mig-manager-qz9lf, nvidia-gpu-operator/nvidia-node-status-exporter-nvrvj, nvidia-gpu-operator/nvidia-operator-validator-jqrwq, openshift-cluster-node-tuning-operator/tuned-rvjkb, openshift-dns/dns-default-s742x, openshift-dns/node-resolver-b4zb4, openshift-image-registry/node-ca-v76q5, openshift-ingress-canary/ingress-canary-k92pz, openshift-logging/collector-xrvxj, openshift-machine-config-operator/machine-config-daemon-d26xv, openshift-monitoring/node-exporter-7htxs, openshift-multus/multus-additional-cni-plugins-vsg9t, openshift-multus/multus-bwtlz, openshift-multus/network-metrics-daemon-fhvn4, openshift-network-diagnostics/network-check-target-26thj, openshift-nfd/nfd-worker-m2n5l, openshift-nmstate/nmstate-handler-dh4k4, openshift-sdn/sdn-mfhm8, openshift-storage/csi-rbdplugin-f74x9, seccomp-profile-installer/seccomp-profile-installer-sjzm9
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-monitoring/prometheus-k8s-0, openshift-user-workload-monitoring/prometheus-user-workload-0, openshift-user-workload-monitoring/thanos-ruler-user-workload-0, rhods-notebooks/jupyter-nb-aalaman-40bu-2eedu-0, rhods-notebooks/jupyter-nb-aryasur-40bu-2eedu-0, rhods-notebooks/jupyter-nb-celiag-40bu-2eedu-0, rhods-notebooks/jupyter-nb-ekim6535-40bu-2eedu-0, rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0, rhods-notebooks/jupyter-nb-mmeng-40bu-2eedu-0, rhods-notebooks/jupyter-nb-nickt03-40bu-2eedu-0, rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0, rhods-notebooks/jupyter-nb-sliu10-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tessat-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0, rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0, sail-24887a/redis-558f589fbb-c77ws
node/wrk-96 already cordoned (server dry run)
error: unable to drain node "wrk-96" due to error:[cannot delete Pods with local storage (use --delete-emptydir-data to override): ai-telemetry-cbca60/keycloak-0, koku-metrics-operator/curatordb-cluster-repo-host-0, ope-rhods-testing-1fef2f/danni-test-2-0, ope-rhods-testing-1fef2f/ja-ucsls-0, ope-rhods-testing-1fef2f/meera-utc-0, ope-rhods-testing-1fef2f/test-image-0, ope-rhods-testing-1fef2f/vnc-0, openshift-monitoring/alertmanager-main-0, cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-kvszc, nvidia-gpu-operator/gpu-feature-discovery-l5mvr, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lb6gb, nvidia-gpu-operator/nvidia-dcgm-exporter-jwqdd, nvidia-gpu-operator/nvidia-dcgm-ljf2v, nvidia-gpu-operator/nvidia-device-plugin-daemonset-nzfvq, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-jqrsg, nvidia-gpu-operator/nvidia-mig-manager-wfxlc, nvidia-gpu-operator/nvidia-node-status-exporter-x874h, nvidia-gpu-operator/nvidia-operator-validator-vg9p4, openshift-cluster-node-tuning-operator/tuned-4whll, openshift-dns/dns-default-fj764, openshift-dns/node-resolver-wnhwt, openshift-image-registry/node-ca-wnf8r, openshift-ingress-canary/ingress-canary-wg7k2, openshift-logging/collector-qsdgg, openshift-machine-config-operator/machine-config-daemon-klc7h, openshift-monitoring/node-exporter-wcrnf, openshift-multus/multus-6b9m4, openshift-multus/multus-additional-cni-plugins-wqd74, openshift-multus/network-metrics-daemon-47xxj, openshift-network-diagnostics/network-check-target-bknh8, openshift-nfd/nfd-worker-d6kr4, openshift-nmstate/nmstate-handler-fjclc, openshift-sdn/sdn-67j55, openshift-storage/csi-rbdplugin-fwp2g, seccomp-profile-installer/seccomp-profile-installer-d4qhj], continuing command...
There are pending nodes to be drained:
wrk-96
cannot delete Pods with local storage (use --delete-emptydir-data to override): ai-telemetry-cbca60/keycloak-0, koku-metrics-operator/curatordb-cluster-repo-host-0, ope-rhods-testing-1fef2f/danni-test-2-0, ope-rhods-testing-1fef2f/ja-ucsls-0, ope-rhods-testing-1fef2f/meera-utc-0, ope-rhods-testing-1fef2f/test-image-0, ope-rhods-testing-1fef2f/vnc-0, openshift-monitoring/alertmanager-main-0
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-kvszc, nvidia-gpu-operator/gpu-feature-discovery-l5mvr, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lb6gb, nvidia-gpu-operator/nvidia-dcgm-exporter-jwqdd, nvidia-gpu-operator/nvidia-dcgm-ljf2v, nvidia-gpu-operator/nvidia-device-plugin-daemonset-nzfvq, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-jqrsg, nvidia-gpu-operator/nvidia-mig-manager-wfxlc, nvidia-gpu-operator/nvidia-node-status-exporter-x874h, nvidia-gpu-operator/nvidia-operator-validator-vg9p4, openshift-cluster-node-tuning-operator/tuned-4whll, openshift-dns/dns-default-fj764, openshift-dns/node-resolver-wnhwt, openshift-image-registry/node-ca-wnf8r, openshift-ingress-canary/ingress-canary-wg7k2, openshift-logging/collector-qsdgg, openshift-machine-config-operator/machine-config-daemon-klc7h, openshift-monitoring/node-exporter-wcrnf, openshift-multus/multus-6b9m4, openshift-multus/multus-additional-cni-plugins-wqd74, openshift-multus/network-metrics-daemon-47xxj, openshift-network-diagnostics/network-check-target-bknh8, openshift-nfd/nfd-worker-d6kr4, openshift-nmstate/nmstate-handler-fjclc, openshift-sdn/sdn-67j55, openshift-storage/csi-rbdplugin-fwp2g, seccomp-profile-installer/seccomp-profile-installer-d4qhj
node/wrk-101 already cordoned (server dry run)
error: unable to drain node "wrk-101" due to error:[cannot delete Pods with local storage (use --delete-emptydir-data to override): curator-system/curatordb-cluster-00-lgb6-0, openshift-storage/object-backing-store-noobaa-pod-552cdf65, sail-24887a/comets-mongo-6c79b94448-fcg5h, cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-27n8k, nvidia-gpu-operator/gpu-feature-discovery-wktn8, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-p5b5w, nvidia-gpu-operator/nvidia-dcgm-exporter-dfhg4, nvidia-gpu-operator/nvidia-dcgm-tz7cm, nvidia-gpu-operator/nvidia-device-plugin-daemonset-mg45p, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-cgbwj, nvidia-gpu-operator/nvidia-mig-manager-8pcxq, nvidia-gpu-operator/nvidia-node-status-exporter-g26qq, nvidia-gpu-operator/nvidia-operator-validator-4bdzg, openshift-cluster-node-tuning-operator/tuned-qtk2z, openshift-dns/dns-default-4rpvn, openshift-dns/node-resolver-rrxvb, openshift-image-registry/node-ca-4n5jl, openshift-ingress-canary/ingress-canary-kqgxv, openshift-logging/collector-wzhxl, openshift-machine-config-operator/machine-config-daemon-clm97, openshift-monitoring/node-exporter-zn65c, openshift-multus/multus-additional-cni-plugins-x4wjf, openshift-multus/multus-npzsd, openshift-multus/network-metrics-daemon-mc46n, openshift-network-diagnostics/network-check-target-z2dhn, openshift-nfd/nfd-worker-smjbl, openshift-nmstate/nmstate-handler-zrfp9, openshift-sdn/sdn-fh6cf, openshift-storage/csi-rbdplugin-9lfjr, seccomp-profile-installer/seccomp-profile-installer-zn4wr], continuing command...
There are pending nodes to be drained:
wrk-101
cannot delete Pods with local storage (use --delete-emptydir-data to override): curator-system/curatordb-cluster-00-lgb6-0, openshift-storage/object-backing-store-noobaa-pod-552cdf65, sail-24887a/comets-mongo-6c79b94448-fcg5h
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-27n8k, nvidia-gpu-operator/gpu-feature-discovery-wktn8, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-p5b5w, nvidia-gpu-operator/nvidia-dcgm-exporter-dfhg4, nvidia-gpu-operator/nvidia-dcgm-tz7cm, nvidia-gpu-operator/nvidia-device-plugin-daemonset-mg45p, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-cgbwj, nvidia-gpu-operator/nvidia-mig-manager-8pcxq, nvidia-gpu-operator/nvidia-node-status-exporter-g26qq, nvidia-gpu-operator/nvidia-operator-validator-4bdzg, openshift-cluster-node-tuning-operator/tuned-qtk2z, openshift-dns/dns-default-4rpvn, openshift-dns/node-resolver-rrxvb, openshift-image-registry/node-ca-4n5jl, openshift-ingress-canary/ingress-canary-kqgxv, openshift-logging/collector-wzhxl, openshift-machine-config-operator/machine-config-daemon-clm97, openshift-monitoring/node-exporter-zn65c, openshift-multus/multus-additional-cni-plugins-x4wjf, openshift-multus/multus-npzsd, openshift-multus/network-metrics-daemon-mc46n, openshift-network-diagnostics/network-check-target-z2dhn, openshift-nfd/nfd-worker-smjbl, openshift-nmstate/nmstate-handler-zrfp9, openshift-sdn/sdn-fh6cf, openshift-storage/csi-rbdplugin-9lfjr, seccomp-profile-installer/seccomp-profile-installer-zn4wr
For now, RH has the hardware they need so we can wait to see if the pods stop by themself soon. If they do not empty by the time we do our next Maintenance let's do it then. If we need to provide GPUs to RH sooner than the next maintenance, @Milstein can you create (or if you already have one copy-paste) a short communication for these users to let them know their projects will be restarted on new nodes?
@jtriley have any of the nodes cleared of pods?
Still looking at the list of pods but looks like 2/4 have cleared - the other 2 at least have rhods notebooks still running:
node/wrk-94 already cordoned
error: unable to drain node "wrk-94" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-f8qn4, nvidia-gpu-operator/gpu-feature-discovery-wf4zs, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-58c9q, nvidia-gpu-operator/nvidia-dcgm-exporter-ptdxq, nvidia-gpu-operator/nvidia-dcgm-k99h5, nvidia-gpu-operator/nvidia-device-plugin-daemonset-gp6zx, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-4wg7l, nvidia-gpu-operator/nvidia-mig-manager-flr2w, nvidia-gpu-operator/nvidia-node-status-exporter-hfhvx, nvidia-gpu-operator/nvidia-operator-validator-4xzv5, openshift-cluster-node-tuning-operator/tuned-tkw8j, openshift-dns/dns-default-z7rq4, openshift-dns/node-resolver-gvxzq, openshift-image-registry/node-ca-zznc6, openshift-ingress-canary/ingress-canary-tspxc, openshift-logging/collector-hhlxt, openshift-machine-config-operator/machine-config-daemon-kfw4q, openshift-monitoring/node-exporter-h2dpm, openshift-multus/multus-68c4p, openshift-multus/multus-additional-cni-plugins-d942k, openshift-multus/network-metrics-daemon-6hkzk, openshift-network-diagnostics/network-check-target-x7lcj, openshift-nfd/nfd-worker-s57pg, openshift-nmstate/nmstate-handler-zmnp5, openshift-sdn/sdn-t7g7x, openshift-storage/csi-rbdplugin-xp6wm, seccomp-profile-installer/seccomp-profile-installer-m64kq, cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-user-workload-monitoring/thanos-ruler-user-workload-1, rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0, rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0, rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0, rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0], continuing command...
There are pending nodes to be drained:
wrk-94
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-f8qn4, nvidia-gpu-operator/gpu-feature-discovery-wf4zs, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-58c9q, nvidia-gpu-operator/nvidia-dcgm-exporter-ptdxq, nvidia-gpu-operator/nvidia-dcgm-k99h5, nvidia-gpu-operator/nvidia-device-plugin-daemonset-gp6zx, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-4wg7l, nvidia-gpu-operator/nvidia-mig-manager-flr2w, nvidia-gpu-operator/nvidia-node-status-exporter-hfhvx, nvidia-gpu-operator/nvidia-operator-validator-4xzv5, openshift-cluster-node-tuning-operator/tuned-tkw8j, openshift-dns/dns-default-z7rq4, openshift-dns/node-resolver-gvxzq, openshift-image-registry/node-ca-zznc6, openshift-ingress-canary/ingress-canary-tspxc, openshift-logging/collector-hhlxt, openshift-machine-config-operator/machine-config-daemon-kfw4q, openshift-monitoring/node-exporter-h2dpm, openshift-multus/multus-68c4p, openshift-multus/multus-additional-cni-plugins-d942k, openshift-multus/network-metrics-daemon-6hkzk, openshift-network-diagnostics/network-check-target-x7lcj, openshift-nfd/nfd-worker-s57pg, openshift-nmstate/nmstate-handler-zmnp5, openshift-sdn/sdn-t7g7x, openshift-storage/csi-rbdplugin-xp6wm, seccomp-profile-installer/seccomp-profile-installer-m64kq
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-user-workload-monitoring/thanos-ruler-user-workload-1, rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0, rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0, rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0, rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0
node/wrk-95 already cordoned
error: unable to drain node "wrk-95" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-rc7wz, nvidia-gpu-operator/gpu-feature-discovery-72bpf, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lgtv8, nvidia-gpu-operator/nvidia-dcgm-824mf, nvidia-gpu-operator/nvidia-dcgm-exporter-d8xv9, nvidia-gpu-operator/nvidia-device-plugin-daemonset-k9vj5, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-m6dzg, nvidia-gpu-operator/nvidia-mig-manager-qz9lf, nvidia-gpu-operator/nvidia-node-status-exporter-nvrvj, nvidia-gpu-operator/nvidia-operator-validator-jqrwq, openshift-cluster-node-tuning-operator/tuned-rvjkb, openshift-dns/dns-default-s742x, openshift-dns/node-resolver-b4zb4, openshift-image-registry/node-ca-v76q5, openshift-ingress-canary/ingress-canary-k92pz, openshift-logging/collector-xrvxj, openshift-machine-config-operator/machine-config-daemon-d26xv, openshift-monitoring/node-exporter-7htxs, openshift-multus/multus-additional-cni-plugins-vsg9t, openshift-multus/multus-bwtlz, openshift-multus/network-metrics-daemon-fhvn4, openshift-network-diagnostics/network-check-target-26thj, openshift-nfd/nfd-worker-m2n5l, openshift-nmstate/nmstate-handler-kxdtq, openshift-sdn/sdn-mfhm8, openshift-storage/csi-rbdplugin-f74x9, seccomp-profile-installer/seccomp-profile-installer-sjzm9, cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-monitoring/prometheus-k8s-0, openshift-user-workload-monitoring/prometheus-user-workload-0, openshift-user-workload-monitoring/thanos-ruler-user-workload-0, rhods-notebooks/jupyter-nb-ekim6535-40bu-2eedu-0, rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0, rhods-notebooks/jupyter-nb-nickt03-40bu-2eedu-0, rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0, rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0, sail-24887a/redis-558f589fbb-c77ws], continuing command...
There are pending nodes to be drained:
wrk-95
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-rc7wz, nvidia-gpu-operator/gpu-feature-discovery-72bpf, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lgtv8, nvidia-gpu-operator/nvidia-dcgm-824mf, nvidia-gpu-operator/nvidia-dcgm-exporter-d8xv9, nvidia-gpu-operator/nvidia-device-plugin-daemonset-k9vj5, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-m6dzg, nvidia-gpu-operator/nvidia-mig-manager-qz9lf, nvidia-gpu-operator/nvidia-node-status-exporter-nvrvj, nvidia-gpu-operator/nvidia-operator-validator-jqrwq, openshift-cluster-node-tuning-operator/tuned-rvjkb, openshift-dns/dns-default-s742x, openshift-dns/node-resolver-b4zb4, openshift-image-registry/node-ca-v76q5, openshift-ingress-canary/ingress-canary-k92pz, openshift-logging/collector-xrvxj, openshift-machine-config-operator/machine-config-daemon-d26xv, openshift-monitoring/node-exporter-7htxs, openshift-multus/multus-additional-cni-plugins-vsg9t, openshift-multus/multus-bwtlz, openshift-multus/network-metrics-daemon-fhvn4, openshift-network-diagnostics/network-check-target-26thj, openshift-nfd/nfd-worker-m2n5l, openshift-nmstate/nmstate-handler-kxdtq, openshift-sdn/sdn-mfhm8, openshift-storage/csi-rbdplugin-f74x9, seccomp-profile-installer/seccomp-profile-installer-sjzm9
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-monitoring/prometheus-k8s-0, openshift-user-workload-monitoring/prometheus-user-workload-0, openshift-user-workload-monitoring/thanos-ruler-user-workload-0, rhods-notebooks/jupyter-nb-ekim6535-40bu-2eedu-0, rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0, rhods-notebooks/jupyter-nb-nickt03-40bu-2eedu-0, rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0, rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0, sail-24887a/redis-558f589fbb-c77ws
node/wrk-96 already cordoned
error: unable to drain node "wrk-96" due to error:[cannot delete Pods with local storage (use --delete-emptydir-data to override): ai-telemetry-cbca60/keycloak-0, koku-metrics-operator/curatordb-cluster-repo-host-0, ope-rhods-testing-1fef2f/ja-ucsls-0, ope-rhods-testing-1fef2f/meera-utc-0, ope-rhods-testing-1fef2f/test-image-0, ope-rhods-testing-1fef2f/vnc-0, openshift-monitoring/alertmanager-main-0, cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-kvszc, nvidia-gpu-operator/gpu-feature-discovery-l5mvr, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lb6gb, nvidia-gpu-operator/nvidia-dcgm-exporter-jwqdd, nvidia-gpu-operator/nvidia-dcgm-ljf2v, nvidia-gpu-operator/nvidia-device-plugin-daemonset-nzfvq, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-jqrsg, nvidia-gpu-operator/nvidia-mig-manager-wfxlc, nvidia-gpu-operator/nvidia-node-status-exporter-x874h, nvidia-gpu-operator/nvidia-operator-validator-vg9p4, openshift-cluster-node-tuning-operator/tuned-4whll, openshift-dns/dns-default-fj764, openshift-dns/node-resolver-wnhwt, openshift-image-registry/node-ca-wnf8r, openshift-ingress-canary/ingress-canary-wg7k2, openshift-logging/collector-qsdgg, openshift-machine-config-operator/machine-config-daemon-klc7h, openshift-monitoring/node-exporter-wcrnf, openshift-multus/multus-6b9m4, openshift-multus/multus-additional-cni-plugins-wqd74, openshift-multus/network-metrics-daemon-47xxj, openshift-network-diagnostics/network-check-target-bknh8, openshift-nfd/nfd-worker-d6kr4, openshift-nmstate/nmstate-handler-x6mwm, openshift-sdn/sdn-67j55, openshift-storage/csi-rbdplugin-fwp2g, seccomp-profile-installer/seccomp-profile-installer-d4qhj], continuing command...
There are pending nodes to be drained:
wrk-96
cannot delete Pods with local storage (use --delete-emptydir-data to override): ai-telemetry-cbca60/keycloak-0, koku-metrics-operator/curatordb-cluster-repo-host-0, ope-rhods-testing-1fef2f/ja-ucsls-0, ope-rhods-testing-1fef2f/meera-utc-0, ope-rhods-testing-1fef2f/test-image-0, ope-rhods-testing-1fef2f/vnc-0, openshift-monitoring/alertmanager-main-0
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-kvszc, nvidia-gpu-operator/gpu-feature-discovery-l5mvr, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lb6gb, nvidia-gpu-operator/nvidia-dcgm-exporter-jwqdd, nvidia-gpu-operator/nvidia-dcgm-ljf2v, nvidia-gpu-operator/nvidia-device-plugin-daemonset-nzfvq, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-jqrsg, nvidia-gpu-operator/nvidia-mig-manager-wfxlc, nvidia-gpu-operator/nvidia-node-status-exporter-x874h, nvidia-gpu-operator/nvidia-operator-validator-vg9p4, openshift-cluster-node-tuning-operator/tuned-4whll, openshift-dns/dns-default-fj764, openshift-dns/node-resolver-wnhwt, openshift-image-registry/node-ca-wnf8r, openshift-ingress-canary/ingress-canary-wg7k2, openshift-logging/collector-qsdgg, openshift-machine-config-operator/machine-config-daemon-klc7h, openshift-monitoring/node-exporter-wcrnf, openshift-multus/multus-6b9m4, openshift-multus/multus-additional-cni-plugins-wqd74, openshift-multus/network-metrics-daemon-47xxj, openshift-network-diagnostics/network-check-target-bknh8, openshift-nfd/nfd-worker-d6kr4, openshift-nmstate/nmstate-handler-x6mwm, openshift-sdn/sdn-67j55, openshift-storage/csi-rbdplugin-fwp2g, seccomp-profile-installer/seccomp-profile-installer-d4qhj
node/wrk-101 already cordoned
error: unable to drain node "wrk-101" due to error:[cannot delete Pods with local storage (use --delete-emptydir-data to override): curator-system/curatordb-cluster-00-lgb6-0, openshift-storage/object-backing-store-noobaa-pod-552cdf65, sail-24887a/comets-mongo-6c79b94448-fcg5h, cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-27n8k, nvidia-gpu-operator/gpu-feature-discovery-wktn8, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-p5b5w, nvidia-gpu-operator/nvidia-dcgm-exporter-dfhg4, nvidia-gpu-operator/nvidia-dcgm-tz7cm, nvidia-gpu-operator/nvidia-device-plugin-daemonset-mg45p, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-cgbwj, nvidia-gpu-operator/nvidia-mig-manager-8pcxq, nvidia-gpu-operator/nvidia-node-status-exporter-g26qq, nvidia-gpu-operator/nvidia-operator-validator-4bdzg, openshift-cluster-node-tuning-operator/tuned-qtk2z, openshift-dns/dns-default-4rpvn, openshift-dns/node-resolver-rrxvb, openshift-image-registry/node-ca-4n5jl, openshift-ingress-canary/ingress-canary-kqgxv, openshift-logging/collector-wzhxl, openshift-machine-config-operator/machine-config-daemon-clm97, openshift-monitoring/node-exporter-zn65c, openshift-multus/multus-additional-cni-plugins-x4wjf, openshift-multus/multus-npzsd, openshift-multus/network-metrics-daemon-mc46n, openshift-network-diagnostics/network-check-target-z2dhn, openshift-nfd/nfd-worker-smjbl, openshift-nmstate/nmstate-handler-pf6qk, openshift-sdn/sdn-fh6cf, openshift-storage/csi-rbdplugin-9lfjr, seccomp-profile-installer/seccomp-profile-installer-zn4wr], continuing command...
There are pending nodes to be drained:
wrk-101
cannot delete Pods with local storage (use --delete-emptydir-data to override): curator-system/curatordb-cluster-00-lgb6-0, openshift-storage/object-backing-store-noobaa-pod-552cdf65, sail-24887a/comets-mongo-6c79b94448-fcg5h
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-27n8k, nvidia-gpu-operator/gpu-feature-discovery-wktn8, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-p5b5w, nvidia-gpu-operator/nvidia-dcgm-exporter-dfhg4, nvidia-gpu-operator/nvidia-dcgm-tz7cm, nvidia-gpu-operator/nvidia-device-plugin-daemonset-mg45p, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-cgbwj, nvidia-gpu-operator/nvidia-mig-manager-8pcxq, nvidia-gpu-operator/nvidia-node-status-exporter-g26qq, nvidia-gpu-operator/nvidia-operator-validator-4bdzg, openshift-cluster-node-tuning-operator/tuned-qtk2z, openshift-dns/dns-default-4rpvn, openshift-dns/node-resolver-rrxvb, openshift-image-registry/node-ca-4n5jl, openshift-ingress-canary/ingress-canary-kqgxv, openshift-logging/collector-wzhxl, openshift-machine-config-operator/machine-config-daemon-clm97, openshift-monitoring/node-exporter-zn65c, openshift-multus/multus-additional-cni-plugins-x4wjf, openshift-multus/multus-npzsd, openshift-multus/network-metrics-daemon-mc46n, openshift-network-diagnostics/network-check-target-z2dhn, openshift-nfd/nfd-worker-smjbl, openshift-nmstate/nmstate-handler-pf6qk, openshift-sdn/sdn-fh6cf, openshift-storage/csi-rbdplugin-9lfjr, seccomp-profile-installer/seccomp-profile-installer-zn4wr
Motivation
We need more and more BM GPU nodes for RH. We currently are well below capacity for OpenShift usage. If we cordon nodes they will be easy to move to BM or OpenShift. By condoning them we will force the usage to the other 4 nodes and that way we have 4 nodes that are more flexible
Completion Criteria
6 nodes cordoned and ready to move to either BM or uncordon for OpenShift
Description
Completion dates
Desired - 2024-09-05 Required - TBD