Closed joachimweyl closed 1 week ago
@jtriley How much effort is this to kick off? Do you have a timeframe you have it planned for?
@joachimweyl looking into this now. Currently scanning the nodes for GPU workloads using:
$ oc get nodes -o name -l 'nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB' | xargs -I {} -t oc debug --as=system:admin {} -- chroot /host bash -c 'crictl exec $(crictl ps --name nvidia-driver-ctr -q) nvidia-smi --query-compute-apps=pid --format csv,noheader'
How did this go?
@joachimweyl I've cordoned the following 4x A100-SXM4
nodes in the prod cluster:
$ oc get nodes -l 'nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB' | grep -i schedulingdisabled
wrk-101 Ready,SchedulingDisabled worker 173d v1.28.11+add48d0
wrk-94 Ready,SchedulingDisabled worker 189d v1.28.11+add48d0
wrk-95 Ready,SchedulingDisabled worker 189d v1.28.11+add48d0
wrk-96 Ready,SchedulingDisabled worker 189d v1.28.11+add48d0
All of these nodes appear to have some user workloads active, however, I've confirmed that wrk-101
and wrk-96
have no rhods-notebook
pods running. These also have no active GPU workloads on them according to nvidia-smi
.
In order to remove these from the prod cluster for ESI we'll need to do one of the following:
The pods running on those 4x hosts currently:
node/wrk-94 cordoned (server dry run)
error: unable to drain node "wrk-94" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-f8qn4, nvidia-gpu-operator/gpu-feature-discovery-wf4zs, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-58c9q, nvidia-gpu-operator/nvidia-dcgm-exporter-ptdxq, nvidia-gpu-operator/nvidia-dcgm-k99h5, nvidia-gpu-operator/nvidia-device-plugin-daemonset-gp6zx, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-4wg7l, nvidia-gpu-operator/nvidia-mig-manager-flr2w, nvidia-gpu-operator/nvidia-node-status-exporter-hfhvx, nvidia-gpu-operator/nvidia-operator-validator-4xzv5, openshift-cluster-node-tuning-operator/tuned-tkw8j, openshift-dns/dns-default-z7rq4, openshift-dns/node-resolver-gvxzq, openshift-image-registry/node-ca-zznc6, openshift-ingress-canary/ingress-canary-tspxc, openshift-logging/collector-hhlxt, openshift-machine-config-operator/machine-config-daemon-kfw4q, openshift-monitoring/node-exporter-h2dpm, openshift-multus/multus-68c4p, openshift-multus/multus-additional-cni-plugins-d942k, openshift-multus/network-metrics-daemon-6hkzk, openshift-network-diagnostics/network-check-target-x7lcj, openshift-nfd/nfd-worker-s57pg, openshift-nmstate/nmstate-handler-q7vvf, openshift-sdn/sdn-t7g7x, openshift-storage/csi-rbdplugin-xp6wm, seccomp-profile-installer/seccomp-profile-installer-m64kq, cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-user-workload-monitoring/thanos-ruler-user-workload-1, rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0, rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0, rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0, rhods-notebooks/jupyter-nb-hjgong-40bu-2eedu-0, rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0, rhods-notebooks/jupyter-nb-kimc-40bu-2eedu-0, rhods-notebooks/jupyter-nb-kthanasi-40bu-2eedu-0, rhods-notebooks/jupyter-nb-lferris1-40bu-2eedu-0, rhods-notebooks/jupyter-nb-lgp116-40bu-2eedu-0, rhods-notebooks/jupyter-nb-linhb-40bu-2eedu-0, rhods-notebooks/jupyter-nb-mikelel-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tsaij-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zhaoxm-40bu-2eedu-0, rustproject-3304e3/working-section-0], continuing command...
There are pending nodes to be drained:
wrk-94
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-f8qn4, nvidia-gpu-operator/gpu-feature-discovery-wf4zs, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-58c9q, nvidia-gpu-operator/nvidia-dcgm-exporter-ptdxq, nvidia-gpu-operator/nvidia-dcgm-k99h5, nvidia-gpu-operator/nvidia-device-plugin-daemonset-gp6zx, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-4wg7l, nvidia-gpu-operator/nvidia-mig-manager-flr2w, nvidia-gpu-operator/nvidia-node-status-exporter-hfhvx, nvidia-gpu-operator/nvidia-operator-validator-4xzv5, openshift-cluster-node-tuning-operator/tuned-tkw8j, openshift-dns/dns-default-z7rq4, openshift-dns/node-resolver-gvxzq, openshift-image-registry/node-ca-zznc6, openshift-ingress-canary/ingress-canary-tspxc, openshift-logging/collector-hhlxt, openshift-machine-config-operator/machine-config-daemon-kfw4q, openshift-monitoring/node-exporter-h2dpm, openshift-multus/multus-68c4p, openshift-multus/multus-additional-cni-plugins-d942k, openshift-multus/network-metrics-daemon-6hkzk, openshift-network-diagnostics/network-check-target-x7lcj, openshift-nfd/nfd-worker-s57pg, openshift-nmstate/nmstate-handler-q7vvf, openshift-sdn/sdn-t7g7x, openshift-storage/csi-rbdplugin-xp6wm, seccomp-profile-installer/seccomp-profile-installer-m64kq
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-user-workload-monitoring/thanos-ruler-user-workload-1, rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0, rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0, rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0, rhods-notebooks/jupyter-nb-hjgong-40bu-2eedu-0, rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0, rhods-notebooks/jupyter-nb-kimc-40bu-2eedu-0, rhods-notebooks/jupyter-nb-kthanasi-40bu-2eedu-0, rhods-notebooks/jupyter-nb-lferris1-40bu-2eedu-0, rhods-notebooks/jupyter-nb-lgp116-40bu-2eedu-0, rhods-notebooks/jupyter-nb-linhb-40bu-2eedu-0, rhods-notebooks/jupyter-nb-mikelel-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tsaij-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zhaoxm-40bu-2eedu-0, rustproject-3304e3/working-section-0
node/wrk-95 cordoned (server dry run)
error: unable to drain node "wrk-95" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-rc7wz, nvidia-gpu-operator/gpu-feature-discovery-72bpf, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lgtv8, nvidia-gpu-operator/nvidia-dcgm-824mf, nvidia-gpu-operator/nvidia-dcgm-exporter-d8xv9, nvidia-gpu-operator/nvidia-device-plugin-daemonset-k9vj5, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-m6dzg, nvidia-gpu-operator/nvidia-mig-manager-qz9lf, nvidia-gpu-operator/nvidia-node-status-exporter-nvrvj, nvidia-gpu-operator/nvidia-operator-validator-jqrwq, openshift-cluster-node-tuning-operator/tuned-rvjkb, openshift-dns/dns-default-s742x, openshift-dns/node-resolver-b4zb4, openshift-image-registry/node-ca-v76q5, openshift-ingress-canary/ingress-canary-k92pz, openshift-logging/collector-xrvxj, openshift-machine-config-operator/machine-config-daemon-d26xv, openshift-monitoring/node-exporter-7htxs, openshift-multus/multus-additional-cni-plugins-vsg9t, openshift-multus/multus-bwtlz, openshift-multus/network-metrics-daemon-fhvn4, openshift-network-diagnostics/network-check-target-26thj, openshift-nfd/nfd-worker-m2n5l, openshift-nmstate/nmstate-handler-dh4k4, openshift-sdn/sdn-mfhm8, openshift-storage/csi-rbdplugin-f74x9, seccomp-profile-installer/seccomp-profile-installer-sjzm9, cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-monitoring/prometheus-k8s-0, openshift-user-workload-monitoring/prometheus-user-workload-0, openshift-user-workload-monitoring/thanos-ruler-user-workload-0, rhods-notebooks/jupyter-nb-aalaman-40bu-2eedu-0, rhods-notebooks/jupyter-nb-aryasur-40bu-2eedu-0, rhods-notebooks/jupyter-nb-celiag-40bu-2eedu-0, rhods-notebooks/jupyter-nb-ekim6535-40bu-2eedu-0, rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0, rhods-notebooks/jupyter-nb-mmeng-40bu-2eedu-0, rhods-notebooks/jupyter-nb-nickt03-40bu-2eedu-0, rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0, rhods-notebooks/jupyter-nb-sliu10-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tessat-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0, rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0, sail-24887a/redis-558f589fbb-c77ws], continuing command...
There are pending nodes to be drained:
wrk-95
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-rc7wz, nvidia-gpu-operator/gpu-feature-discovery-72bpf, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lgtv8, nvidia-gpu-operator/nvidia-dcgm-824mf, nvidia-gpu-operator/nvidia-dcgm-exporter-d8xv9, nvidia-gpu-operator/nvidia-device-plugin-daemonset-k9vj5, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-m6dzg, nvidia-gpu-operator/nvidia-mig-manager-qz9lf, nvidia-gpu-operator/nvidia-node-status-exporter-nvrvj, nvidia-gpu-operator/nvidia-operator-validator-jqrwq, openshift-cluster-node-tuning-operator/tuned-rvjkb, openshift-dns/dns-default-s742x, openshift-dns/node-resolver-b4zb4, openshift-image-registry/node-ca-v76q5, openshift-ingress-canary/ingress-canary-k92pz, openshift-logging/collector-xrvxj, openshift-machine-config-operator/machine-config-daemon-d26xv, openshift-monitoring/node-exporter-7htxs, openshift-multus/multus-additional-cni-plugins-vsg9t, openshift-multus/multus-bwtlz, openshift-multus/network-metrics-daemon-fhvn4, openshift-network-diagnostics/network-check-target-26thj, openshift-nfd/nfd-worker-m2n5l, openshift-nmstate/nmstate-handler-dh4k4, openshift-sdn/sdn-mfhm8, openshift-storage/csi-rbdplugin-f74x9, seccomp-profile-installer/seccomp-profile-installer-sjzm9
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-monitoring/prometheus-k8s-0, openshift-user-workload-monitoring/prometheus-user-workload-0, openshift-user-workload-monitoring/thanos-ruler-user-workload-0, rhods-notebooks/jupyter-nb-aalaman-40bu-2eedu-0, rhods-notebooks/jupyter-nb-aryasur-40bu-2eedu-0, rhods-notebooks/jupyter-nb-celiag-40bu-2eedu-0, rhods-notebooks/jupyter-nb-ekim6535-40bu-2eedu-0, rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0, rhods-notebooks/jupyter-nb-mmeng-40bu-2eedu-0, rhods-notebooks/jupyter-nb-nickt03-40bu-2eedu-0, rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0, rhods-notebooks/jupyter-nb-sliu10-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tessat-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0, rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0, sail-24887a/redis-558f589fbb-c77ws
node/wrk-96 already cordoned (server dry run)
error: unable to drain node "wrk-96" due to error:[cannot delete Pods with local storage (use --delete-emptydir-data to override): ai-telemetry-cbca60/keycloak-0, koku-metrics-operator/curatordb-cluster-repo-host-0, ope-rhods-testing-1fef2f/danni-test-2-0, ope-rhods-testing-1fef2f/ja-ucsls-0, ope-rhods-testing-1fef2f/meera-utc-0, ope-rhods-testing-1fef2f/test-image-0, ope-rhods-testing-1fef2f/vnc-0, openshift-monitoring/alertmanager-main-0, cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-kvszc, nvidia-gpu-operator/gpu-feature-discovery-l5mvr, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lb6gb, nvidia-gpu-operator/nvidia-dcgm-exporter-jwqdd, nvidia-gpu-operator/nvidia-dcgm-ljf2v, nvidia-gpu-operator/nvidia-device-plugin-daemonset-nzfvq, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-jqrsg, nvidia-gpu-operator/nvidia-mig-manager-wfxlc, nvidia-gpu-operator/nvidia-node-status-exporter-x874h, nvidia-gpu-operator/nvidia-operator-validator-vg9p4, openshift-cluster-node-tuning-operator/tuned-4whll, openshift-dns/dns-default-fj764, openshift-dns/node-resolver-wnhwt, openshift-image-registry/node-ca-wnf8r, openshift-ingress-canary/ingress-canary-wg7k2, openshift-logging/collector-qsdgg, openshift-machine-config-operator/machine-config-daemon-klc7h, openshift-monitoring/node-exporter-wcrnf, openshift-multus/multus-6b9m4, openshift-multus/multus-additional-cni-plugins-wqd74, openshift-multus/network-metrics-daemon-47xxj, openshift-network-diagnostics/network-check-target-bknh8, openshift-nfd/nfd-worker-d6kr4, openshift-nmstate/nmstate-handler-fjclc, openshift-sdn/sdn-67j55, openshift-storage/csi-rbdplugin-fwp2g, seccomp-profile-installer/seccomp-profile-installer-d4qhj], continuing command...
There are pending nodes to be drained:
wrk-96
cannot delete Pods with local storage (use --delete-emptydir-data to override): ai-telemetry-cbca60/keycloak-0, koku-metrics-operator/curatordb-cluster-repo-host-0, ope-rhods-testing-1fef2f/danni-test-2-0, ope-rhods-testing-1fef2f/ja-ucsls-0, ope-rhods-testing-1fef2f/meera-utc-0, ope-rhods-testing-1fef2f/test-image-0, ope-rhods-testing-1fef2f/vnc-0, openshift-monitoring/alertmanager-main-0
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-kvszc, nvidia-gpu-operator/gpu-feature-discovery-l5mvr, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lb6gb, nvidia-gpu-operator/nvidia-dcgm-exporter-jwqdd, nvidia-gpu-operator/nvidia-dcgm-ljf2v, nvidia-gpu-operator/nvidia-device-plugin-daemonset-nzfvq, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-jqrsg, nvidia-gpu-operator/nvidia-mig-manager-wfxlc, nvidia-gpu-operator/nvidia-node-status-exporter-x874h, nvidia-gpu-operator/nvidia-operator-validator-vg9p4, openshift-cluster-node-tuning-operator/tuned-4whll, openshift-dns/dns-default-fj764, openshift-dns/node-resolver-wnhwt, openshift-image-registry/node-ca-wnf8r, openshift-ingress-canary/ingress-canary-wg7k2, openshift-logging/collector-qsdgg, openshift-machine-config-operator/machine-config-daemon-klc7h, openshift-monitoring/node-exporter-wcrnf, openshift-multus/multus-6b9m4, openshift-multus/multus-additional-cni-plugins-wqd74, openshift-multus/network-metrics-daemon-47xxj, openshift-network-diagnostics/network-check-target-bknh8, openshift-nfd/nfd-worker-d6kr4, openshift-nmstate/nmstate-handler-fjclc, openshift-sdn/sdn-67j55, openshift-storage/csi-rbdplugin-fwp2g, seccomp-profile-installer/seccomp-profile-installer-d4qhj
node/wrk-101 already cordoned (server dry run)
error: unable to drain node "wrk-101" due to error:[cannot delete Pods with local storage (use --delete-emptydir-data to override): curator-system/curatordb-cluster-00-lgb6-0, openshift-storage/object-backing-store-noobaa-pod-552cdf65, sail-24887a/comets-mongo-6c79b94448-fcg5h, cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-27n8k, nvidia-gpu-operator/gpu-feature-discovery-wktn8, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-p5b5w, nvidia-gpu-operator/nvidia-dcgm-exporter-dfhg4, nvidia-gpu-operator/nvidia-dcgm-tz7cm, nvidia-gpu-operator/nvidia-device-plugin-daemonset-mg45p, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-cgbwj, nvidia-gpu-operator/nvidia-mig-manager-8pcxq, nvidia-gpu-operator/nvidia-node-status-exporter-g26qq, nvidia-gpu-operator/nvidia-operator-validator-4bdzg, openshift-cluster-node-tuning-operator/tuned-qtk2z, openshift-dns/dns-default-4rpvn, openshift-dns/node-resolver-rrxvb, openshift-image-registry/node-ca-4n5jl, openshift-ingress-canary/ingress-canary-kqgxv, openshift-logging/collector-wzhxl, openshift-machine-config-operator/machine-config-daemon-clm97, openshift-monitoring/node-exporter-zn65c, openshift-multus/multus-additional-cni-plugins-x4wjf, openshift-multus/multus-npzsd, openshift-multus/network-metrics-daemon-mc46n, openshift-network-diagnostics/network-check-target-z2dhn, openshift-nfd/nfd-worker-smjbl, openshift-nmstate/nmstate-handler-zrfp9, openshift-sdn/sdn-fh6cf, openshift-storage/csi-rbdplugin-9lfjr, seccomp-profile-installer/seccomp-profile-installer-zn4wr], continuing command...
There are pending nodes to be drained:
wrk-101
cannot delete Pods with local storage (use --delete-emptydir-data to override): curator-system/curatordb-cluster-00-lgb6-0, openshift-storage/object-backing-store-noobaa-pod-552cdf65, sail-24887a/comets-mongo-6c79b94448-fcg5h
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-27n8k, nvidia-gpu-operator/gpu-feature-discovery-wktn8, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-p5b5w, nvidia-gpu-operator/nvidia-dcgm-exporter-dfhg4, nvidia-gpu-operator/nvidia-dcgm-tz7cm, nvidia-gpu-operator/nvidia-device-plugin-daemonset-mg45p, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-cgbwj, nvidia-gpu-operator/nvidia-mig-manager-8pcxq, nvidia-gpu-operator/nvidia-node-status-exporter-g26qq, nvidia-gpu-operator/nvidia-operator-validator-4bdzg, openshift-cluster-node-tuning-operator/tuned-qtk2z, openshift-dns/dns-default-4rpvn, openshift-dns/node-resolver-rrxvb, openshift-image-registry/node-ca-4n5jl, openshift-ingress-canary/ingress-canary-kqgxv, openshift-logging/collector-wzhxl, openshift-machine-config-operator/machine-config-daemon-clm97, openshift-monitoring/node-exporter-zn65c, openshift-multus/multus-additional-cni-plugins-x4wjf, openshift-multus/multus-npzsd, openshift-multus/network-metrics-daemon-mc46n, openshift-network-diagnostics/network-check-target-z2dhn, openshift-nfd/nfd-worker-smjbl, openshift-nmstate/nmstate-handler-zrfp9, openshift-sdn/sdn-fh6cf, openshift-storage/csi-rbdplugin-9lfjr, seccomp-profile-installer/seccomp-profile-installer-zn4wr
For now, RH has the hardware they need so we can wait to see if the pods stop by themself soon. If they do not empty by the time we do our next Maintenance let's do it then. If we need to provide GPUs to RH sooner than the next maintenance, @Milstein can you create (or if you already have one copy-paste) a short communication for these users to let them know their projects will be restarted on new nodes?
@jtriley have any of the nodes cleared of pods?
Still looking at the list of pods but looks like 2/4 have cleared - the other 2 at least have rhods notebooks still running:
node/wrk-94 already cordoned
error: unable to drain node "wrk-94" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-f8qn4, nvidia-gpu-operator/gpu-feature-discovery-wf4zs, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-58c9q, nvidia-gpu-operator/nvidia-dcgm-exporter-ptdxq, nvidia-gpu-operator/nvidia-dcgm-k99h5, nvidia-gpu-operator/nvidia-device-plugin-daemonset-gp6zx, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-4wg7l, nvidia-gpu-operator/nvidia-mig-manager-flr2w, nvidia-gpu-operator/nvidia-node-status-exporter-hfhvx, nvidia-gpu-operator/nvidia-operator-validator-4xzv5, openshift-cluster-node-tuning-operator/tuned-tkw8j, openshift-dns/dns-default-z7rq4, openshift-dns/node-resolver-gvxzq, openshift-image-registry/node-ca-zznc6, openshift-ingress-canary/ingress-canary-tspxc, openshift-logging/collector-hhlxt, openshift-machine-config-operator/machine-config-daemon-kfw4q, openshift-monitoring/node-exporter-h2dpm, openshift-multus/multus-68c4p, openshift-multus/multus-additional-cni-plugins-d942k, openshift-multus/network-metrics-daemon-6hkzk, openshift-network-diagnostics/network-check-target-x7lcj, openshift-nfd/nfd-worker-s57pg, openshift-nmstate/nmstate-handler-zmnp5, openshift-sdn/sdn-t7g7x, openshift-storage/csi-rbdplugin-xp6wm, seccomp-profile-installer/seccomp-profile-installer-m64kq, cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-user-workload-monitoring/thanos-ruler-user-workload-1, rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0, rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0, rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0, rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0], continuing command...
There are pending nodes to be drained:
wrk-94
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-f8qn4, nvidia-gpu-operator/gpu-feature-discovery-wf4zs, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-58c9q, nvidia-gpu-operator/nvidia-dcgm-exporter-ptdxq, nvidia-gpu-operator/nvidia-dcgm-k99h5, nvidia-gpu-operator/nvidia-device-plugin-daemonset-gp6zx, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-4wg7l, nvidia-gpu-operator/nvidia-mig-manager-flr2w, nvidia-gpu-operator/nvidia-node-status-exporter-hfhvx, nvidia-gpu-operator/nvidia-operator-validator-4xzv5, openshift-cluster-node-tuning-operator/tuned-tkw8j, openshift-dns/dns-default-z7rq4, openshift-dns/node-resolver-gvxzq, openshift-image-registry/node-ca-zznc6, openshift-ingress-canary/ingress-canary-tspxc, openshift-logging/collector-hhlxt, openshift-machine-config-operator/machine-config-daemon-kfw4q, openshift-monitoring/node-exporter-h2dpm, openshift-multus/multus-68c4p, openshift-multus/multus-additional-cni-plugins-d942k, openshift-multus/network-metrics-daemon-6hkzk, openshift-network-diagnostics/network-check-target-x7lcj, openshift-nfd/nfd-worker-s57pg, openshift-nmstate/nmstate-handler-zmnp5, openshift-sdn/sdn-t7g7x, openshift-storage/csi-rbdplugin-xp6wm, seccomp-profile-installer/seccomp-profile-installer-m64kq
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-user-workload-monitoring/thanos-ruler-user-workload-1, rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0, rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0, rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0, rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0
node/wrk-95 already cordoned
error: unable to drain node "wrk-95" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-rc7wz, nvidia-gpu-operator/gpu-feature-discovery-72bpf, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lgtv8, nvidia-gpu-operator/nvidia-dcgm-824mf, nvidia-gpu-operator/nvidia-dcgm-exporter-d8xv9, nvidia-gpu-operator/nvidia-device-plugin-daemonset-k9vj5, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-m6dzg, nvidia-gpu-operator/nvidia-mig-manager-qz9lf, nvidia-gpu-operator/nvidia-node-status-exporter-nvrvj, nvidia-gpu-operator/nvidia-operator-validator-jqrwq, openshift-cluster-node-tuning-operator/tuned-rvjkb, openshift-dns/dns-default-s742x, openshift-dns/node-resolver-b4zb4, openshift-image-registry/node-ca-v76q5, openshift-ingress-canary/ingress-canary-k92pz, openshift-logging/collector-xrvxj, openshift-machine-config-operator/machine-config-daemon-d26xv, openshift-monitoring/node-exporter-7htxs, openshift-multus/multus-additional-cni-plugins-vsg9t, openshift-multus/multus-bwtlz, openshift-multus/network-metrics-daemon-fhvn4, openshift-network-diagnostics/network-check-target-26thj, openshift-nfd/nfd-worker-m2n5l, openshift-nmstate/nmstate-handler-kxdtq, openshift-sdn/sdn-mfhm8, openshift-storage/csi-rbdplugin-f74x9, seccomp-profile-installer/seccomp-profile-installer-sjzm9, cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-monitoring/prometheus-k8s-0, openshift-user-workload-monitoring/prometheus-user-workload-0, openshift-user-workload-monitoring/thanos-ruler-user-workload-0, rhods-notebooks/jupyter-nb-ekim6535-40bu-2eedu-0, rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0, rhods-notebooks/jupyter-nb-nickt03-40bu-2eedu-0, rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0, rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0, sail-24887a/redis-558f589fbb-c77ws], continuing command...
There are pending nodes to be drained:
wrk-95
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-rc7wz, nvidia-gpu-operator/gpu-feature-discovery-72bpf, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lgtv8, nvidia-gpu-operator/nvidia-dcgm-824mf, nvidia-gpu-operator/nvidia-dcgm-exporter-d8xv9, nvidia-gpu-operator/nvidia-device-plugin-daemonset-k9vj5, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-m6dzg, nvidia-gpu-operator/nvidia-mig-manager-qz9lf, nvidia-gpu-operator/nvidia-node-status-exporter-nvrvj, nvidia-gpu-operator/nvidia-operator-validator-jqrwq, openshift-cluster-node-tuning-operator/tuned-rvjkb, openshift-dns/dns-default-s742x, openshift-dns/node-resolver-b4zb4, openshift-image-registry/node-ca-v76q5, openshift-ingress-canary/ingress-canary-k92pz, openshift-logging/collector-xrvxj, openshift-machine-config-operator/machine-config-daemon-d26xv, openshift-monitoring/node-exporter-7htxs, openshift-multus/multus-additional-cni-plugins-vsg9t, openshift-multus/multus-bwtlz, openshift-multus/network-metrics-daemon-fhvn4, openshift-network-diagnostics/network-check-target-26thj, openshift-nfd/nfd-worker-m2n5l, openshift-nmstate/nmstate-handler-kxdtq, openshift-sdn/sdn-mfhm8, openshift-storage/csi-rbdplugin-f74x9, seccomp-profile-installer/seccomp-profile-installer-sjzm9
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-monitoring/prometheus-k8s-0, openshift-user-workload-monitoring/prometheus-user-workload-0, openshift-user-workload-monitoring/thanos-ruler-user-workload-0, rhods-notebooks/jupyter-nb-ekim6535-40bu-2eedu-0, rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0, rhods-notebooks/jupyter-nb-nickt03-40bu-2eedu-0, rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0, rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0, sail-24887a/redis-558f589fbb-c77ws
node/wrk-96 already cordoned
error: unable to drain node "wrk-96" due to error:[cannot delete Pods with local storage (use --delete-emptydir-data to override): ai-telemetry-cbca60/keycloak-0, koku-metrics-operator/curatordb-cluster-repo-host-0, ope-rhods-testing-1fef2f/ja-ucsls-0, ope-rhods-testing-1fef2f/meera-utc-0, ope-rhods-testing-1fef2f/test-image-0, ope-rhods-testing-1fef2f/vnc-0, openshift-monitoring/alertmanager-main-0, cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-kvszc, nvidia-gpu-operator/gpu-feature-discovery-l5mvr, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lb6gb, nvidia-gpu-operator/nvidia-dcgm-exporter-jwqdd, nvidia-gpu-operator/nvidia-dcgm-ljf2v, nvidia-gpu-operator/nvidia-device-plugin-daemonset-nzfvq, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-jqrsg, nvidia-gpu-operator/nvidia-mig-manager-wfxlc, nvidia-gpu-operator/nvidia-node-status-exporter-x874h, nvidia-gpu-operator/nvidia-operator-validator-vg9p4, openshift-cluster-node-tuning-operator/tuned-4whll, openshift-dns/dns-default-fj764, openshift-dns/node-resolver-wnhwt, openshift-image-registry/node-ca-wnf8r, openshift-ingress-canary/ingress-canary-wg7k2, openshift-logging/collector-qsdgg, openshift-machine-config-operator/machine-config-daemon-klc7h, openshift-monitoring/node-exporter-wcrnf, openshift-multus/multus-6b9m4, openshift-multus/multus-additional-cni-plugins-wqd74, openshift-multus/network-metrics-daemon-47xxj, openshift-network-diagnostics/network-check-target-bknh8, openshift-nfd/nfd-worker-d6kr4, openshift-nmstate/nmstate-handler-x6mwm, openshift-sdn/sdn-67j55, openshift-storage/csi-rbdplugin-fwp2g, seccomp-profile-installer/seccomp-profile-installer-d4qhj], continuing command...
There are pending nodes to be drained:
wrk-96
cannot delete Pods with local storage (use --delete-emptydir-data to override): ai-telemetry-cbca60/keycloak-0, koku-metrics-operator/curatordb-cluster-repo-host-0, ope-rhods-testing-1fef2f/ja-ucsls-0, ope-rhods-testing-1fef2f/meera-utc-0, ope-rhods-testing-1fef2f/test-image-0, ope-rhods-testing-1fef2f/vnc-0, openshift-monitoring/alertmanager-main-0
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-kvszc, nvidia-gpu-operator/gpu-feature-discovery-l5mvr, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lb6gb, nvidia-gpu-operator/nvidia-dcgm-exporter-jwqdd, nvidia-gpu-operator/nvidia-dcgm-ljf2v, nvidia-gpu-operator/nvidia-device-plugin-daemonset-nzfvq, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-jqrsg, nvidia-gpu-operator/nvidia-mig-manager-wfxlc, nvidia-gpu-operator/nvidia-node-status-exporter-x874h, nvidia-gpu-operator/nvidia-operator-validator-vg9p4, openshift-cluster-node-tuning-operator/tuned-4whll, openshift-dns/dns-default-fj764, openshift-dns/node-resolver-wnhwt, openshift-image-registry/node-ca-wnf8r, openshift-ingress-canary/ingress-canary-wg7k2, openshift-logging/collector-qsdgg, openshift-machine-config-operator/machine-config-daemon-klc7h, openshift-monitoring/node-exporter-wcrnf, openshift-multus/multus-6b9m4, openshift-multus/multus-additional-cni-plugins-wqd74, openshift-multus/network-metrics-daemon-47xxj, openshift-network-diagnostics/network-check-target-bknh8, openshift-nfd/nfd-worker-d6kr4, openshift-nmstate/nmstate-handler-x6mwm, openshift-sdn/sdn-67j55, openshift-storage/csi-rbdplugin-fwp2g, seccomp-profile-installer/seccomp-profile-installer-d4qhj
node/wrk-101 already cordoned
error: unable to drain node "wrk-101" due to error:[cannot delete Pods with local storage (use --delete-emptydir-data to override): curator-system/curatordb-cluster-00-lgb6-0, openshift-storage/object-backing-store-noobaa-pod-552cdf65, sail-24887a/comets-mongo-6c79b94448-fcg5h, cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-27n8k, nvidia-gpu-operator/gpu-feature-discovery-wktn8, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-p5b5w, nvidia-gpu-operator/nvidia-dcgm-exporter-dfhg4, nvidia-gpu-operator/nvidia-dcgm-tz7cm, nvidia-gpu-operator/nvidia-device-plugin-daemonset-mg45p, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-cgbwj, nvidia-gpu-operator/nvidia-mig-manager-8pcxq, nvidia-gpu-operator/nvidia-node-status-exporter-g26qq, nvidia-gpu-operator/nvidia-operator-validator-4bdzg, openshift-cluster-node-tuning-operator/tuned-qtk2z, openshift-dns/dns-default-4rpvn, openshift-dns/node-resolver-rrxvb, openshift-image-registry/node-ca-4n5jl, openshift-ingress-canary/ingress-canary-kqgxv, openshift-logging/collector-wzhxl, openshift-machine-config-operator/machine-config-daemon-clm97, openshift-monitoring/node-exporter-zn65c, openshift-multus/multus-additional-cni-plugins-x4wjf, openshift-multus/multus-npzsd, openshift-multus/network-metrics-daemon-mc46n, openshift-network-diagnostics/network-check-target-z2dhn, openshift-nfd/nfd-worker-smjbl, openshift-nmstate/nmstate-handler-pf6qk, openshift-sdn/sdn-fh6cf, openshift-storage/csi-rbdplugin-9lfjr, seccomp-profile-installer/seccomp-profile-installer-zn4wr], continuing command...
There are pending nodes to be drained:
wrk-101
cannot delete Pods with local storage (use --delete-emptydir-data to override): curator-system/curatordb-cluster-00-lgb6-0, openshift-storage/object-backing-store-noobaa-pod-552cdf65, sail-24887a/comets-mongo-6c79b94448-fcg5h
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-27n8k, nvidia-gpu-operator/gpu-feature-discovery-wktn8, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-p5b5w, nvidia-gpu-operator/nvidia-dcgm-exporter-dfhg4, nvidia-gpu-operator/nvidia-dcgm-tz7cm, nvidia-gpu-operator/nvidia-device-plugin-daemonset-mg45p, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-cgbwj, nvidia-gpu-operator/nvidia-mig-manager-8pcxq, nvidia-gpu-operator/nvidia-node-status-exporter-g26qq, nvidia-gpu-operator/nvidia-operator-validator-4bdzg, openshift-cluster-node-tuning-operator/tuned-qtk2z, openshift-dns/dns-default-4rpvn, openshift-dns/node-resolver-rrxvb, openshift-image-registry/node-ca-4n5jl, openshift-ingress-canary/ingress-canary-kqgxv, openshift-logging/collector-wzhxl, openshift-machine-config-operator/machine-config-daemon-clm97, openshift-monitoring/node-exporter-zn65c, openshift-multus/multus-additional-cni-plugins-x4wjf, openshift-multus/multus-npzsd, openshift-multus/network-metrics-daemon-mc46n, openshift-network-diagnostics/network-check-target-z2dhn, openshift-nfd/nfd-worker-smjbl, openshift-nmstate/nmstate-handler-pf6qk, openshift-sdn/sdn-fh6cf, openshift-storage/csi-rbdplugin-9lfjr, seccomp-profile-installer/seccomp-profile-installer-zn4wr
@DanNiESh Is the reaper able to kill class pod on cordoned nodes? To add context:
rhods-notebooks jupyter-nb-ashikap-40bu-2eedu-0 2/2 Running 0 20d 10.131.20.214 wrk-94 <none> <none>
rhods-notebooks jupyter-nb-bmustafa-40bu-2eedu-0 2/2 Running 0 15d 10.131.20.222 wrk-94 <none> <none>
rhods-notebooks jupyter-nb-feli-40bu-2eedu-0 2/2 Running 0 26d 10.131.20.189 wrk-94 <none> <none>
rhods-notebooks jupyter-nb-jesswm-40bu-2eedu-0 2/2 Running 2 (15d ago) 27d 10.131.20.166 wrk-94 <none> <none>
Those pods/notebooks have all been running in the rhods-notebooks ns for a while
@DanNiESh Is the reaper able to kill class pod on cordoned nodes? To add context:
rhods-notebooks jupyter-nb-ashikap-40bu-2eedu-0 2/2 Running 0 20d 10.131.20.214 wrk-94 <none> <none> rhods-notebooks jupyter-nb-bmustafa-40bu-2eedu-0 2/2 Running 0 15d 10.131.20.222 wrk-94 <none> <none> rhods-notebooks jupyter-nb-feli-40bu-2eedu-0 2/2 Running 0 26d 10.131.20.189 wrk-94 <none> <none> rhods-notebooks jupyter-nb-jesswm-40bu-2eedu-0 2/2 Running 2 (15d ago) 27d 10.131.20.166 wrk-94 <none> <none>
Those pods/notebooks have all been running in the rhods-notebooks ns for a while
Looks like these notebooks belong to ds210 class. The reaper won't shut down notebooks in ds210 course based on professor's requirement.
@joachimweyl @jtriley all the notebook pods in the rhods-notebooks
namespace on nodes wrk-94
and wrk-95
belong to class ds210 which the reaper is not set up to kill, per professor requirements.
If we have to kill the pods we need to notify the students and professor of ds210 in advance.
@jtriley please move the 2 that were cleared to ESI.
As for the other 2 discussed above: @msdisme I assume we want to test out the waters with the PI to see if these pods can be shut down? @Milstein please be prepared to reach out to kthanasi@bu.edu to discuss these pods.
Looking closer at the pods on these hosts I found the following user workloads that might be of concern:
ai-telemetry-cbca60/keycloak-0
ope-rhods-testing-1fef2f/ja-ucsls-0
ope-rhods-testing-1fef2f/meera-utc-0
ope-rhods-testing-1fef2f/test-image-0
ope-rhods-testing-1fef2f/vnc-0
sail-24887a/comets-mongo-6c79b94448-fcg5h
I'm guessing the ope-rhods-testing-1fef2f
stuff could be killed but not sure about the ai-telemetry-cbca60
and sail-24887a
namespace pods.
The keycloak-0
pod will likely come back cleanly from the single-sign-on operator.
I've asked in slack about these pods and am waiting for feedback on whether these are safe to terminate.
Yes, ope-rhods-testing-1fef2f could be killed.
I've removed wrk-96
and wrk-101
and notified @hakasapl via slack that they are available to be moved over to ESI.
These are the user workloads still running on the other 2x A100s we're looking to remove from prod and move to ESI (wrk-94
and wrk-95
):
rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0
rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0
rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0
rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0
rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0
rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0
rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0
rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0
rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0
sail-24887a/redis-558f589fbb-c77ws
I've removed wrk-94
and wrk-95
and notified @hakasapl via slack that they are available to be moved over to ESI.
Motivation
We need more and more BM GPU nodes for RH. We currently are well below capacity for OpenShift usage. If we cordon nodes they will be easy to move to BM or OpenShift. By condoning them we will force the usage to the other 4 nodes and that way we have 4 nodes that are more flexible
Completion Criteria
6 nodes cordoned and ready to move to either BM or uncordon for OpenShift
Description
Completion dates
Desired - 2024-09-05 Required - 2024-10-23