nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

Cordon 4 A100SXM4 Nodes with the least current usage #702

Closed joachimweyl closed 1 week ago

joachimweyl commented 2 months ago

Motivation

We need more and more BM GPU nodes for RH. We currently are well below capacity for OpenShift usage. If we cordon nodes they will be easy to move to BM or OpenShift. By condoning them we will force the usage to the other 4 nodes and that way we have 4 nodes that are more flexible

Completion Criteria

6 nodes cordoned and ready to move to either BM or uncordon for OpenShift

Description

Completion dates

Desired - 2024-09-05 Required - 2024-10-23

joachimweyl commented 2 months ago

@jtriley How much effort is this to kick off? Do you have a timeframe you have it planned for?

jtriley commented 2 months ago

@joachimweyl looking into this now. Currently scanning the nodes for GPU workloads using:

$ oc get nodes -o name -l 'nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB' | xargs -I {} -t oc debug --as=system:admin {} -- chroot /host bash -c 'crictl exec $(crictl ps --name nvidia-driver-ctr -q) nvidia-smi --query-compute-apps=pid --format csv,noheader'
joachimweyl commented 2 months ago

How did this go?

jtriley commented 2 months ago

@joachimweyl I've cordoned the following 4x A100-SXM4 nodes in the prod cluster:

$ oc get nodes -l 'nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB' | grep -i schedulingdisabled
wrk-101   Ready,SchedulingDisabled   worker   173d   v1.28.11+add48d0
wrk-94    Ready,SchedulingDisabled   worker   189d   v1.28.11+add48d0
wrk-95    Ready,SchedulingDisabled   worker   189d   v1.28.11+add48d0
wrk-96    Ready,SchedulingDisabled   worker   189d   v1.28.11+add48d0

All of these nodes appear to have some user workloads active, however, I've confirmed that wrk-101 and wrk-96 have no rhods-notebook pods running. These also have no active GPU workloads on them according to nvidia-smi.

In order to remove these from the prod cluster for ESI we'll need to do one of the following:

  1. Wait for user workload pods to terminate on their own
  2. Notify folks and terminate the pods manually
  3. Drain these nodes during a future maintenance when all worker nodes would have to be rebooted anyway (e.g. udev rule update, upgrade, etc.)
jtriley commented 2 months ago

The pods running on those 4x hosts currently:

node/wrk-94 cordoned (server dry run)
error: unable to drain node "wrk-94" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-f8qn4, nvidia-gpu-operator/gpu-feature-discovery-wf4zs, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-58c9q, nvidia-gpu-operator/nvidia-dcgm-exporter-ptdxq, nvidia-gpu-operator/nvidia-dcgm-k99h5, nvidia-gpu-operator/nvidia-device-plugin-daemonset-gp6zx, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-4wg7l, nvidia-gpu-operator/nvidia-mig-manager-flr2w, nvidia-gpu-operator/nvidia-node-status-exporter-hfhvx, nvidia-gpu-operator/nvidia-operator-validator-4xzv5, openshift-cluster-node-tuning-operator/tuned-tkw8j, openshift-dns/dns-default-z7rq4, openshift-dns/node-resolver-gvxzq, openshift-image-registry/node-ca-zznc6, openshift-ingress-canary/ingress-canary-tspxc, openshift-logging/collector-hhlxt, openshift-machine-config-operator/machine-config-daemon-kfw4q, openshift-monitoring/node-exporter-h2dpm, openshift-multus/multus-68c4p, openshift-multus/multus-additional-cni-plugins-d942k, openshift-multus/network-metrics-daemon-6hkzk, openshift-network-diagnostics/network-check-target-x7lcj, openshift-nfd/nfd-worker-s57pg, openshift-nmstate/nmstate-handler-q7vvf, openshift-sdn/sdn-t7g7x, openshift-storage/csi-rbdplugin-xp6wm, seccomp-profile-installer/seccomp-profile-installer-m64kq, cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-user-workload-monitoring/thanos-ruler-user-workload-1, rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0, rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0, rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0, rhods-notebooks/jupyter-nb-hjgong-40bu-2eedu-0, rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0, rhods-notebooks/jupyter-nb-kimc-40bu-2eedu-0, rhods-notebooks/jupyter-nb-kthanasi-40bu-2eedu-0, rhods-notebooks/jupyter-nb-lferris1-40bu-2eedu-0, rhods-notebooks/jupyter-nb-lgp116-40bu-2eedu-0, rhods-notebooks/jupyter-nb-linhb-40bu-2eedu-0, rhods-notebooks/jupyter-nb-mikelel-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tsaij-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zhaoxm-40bu-2eedu-0, rustproject-3304e3/working-section-0], continuing command...
There are pending nodes to be drained:
 wrk-94
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-f8qn4, nvidia-gpu-operator/gpu-feature-discovery-wf4zs, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-58c9q, nvidia-gpu-operator/nvidia-dcgm-exporter-ptdxq, nvidia-gpu-operator/nvidia-dcgm-k99h5, nvidia-gpu-operator/nvidia-device-plugin-daemonset-gp6zx, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-4wg7l, nvidia-gpu-operator/nvidia-mig-manager-flr2w, nvidia-gpu-operator/nvidia-node-status-exporter-hfhvx, nvidia-gpu-operator/nvidia-operator-validator-4xzv5, openshift-cluster-node-tuning-operator/tuned-tkw8j, openshift-dns/dns-default-z7rq4, openshift-dns/node-resolver-gvxzq, openshift-image-registry/node-ca-zznc6, openshift-ingress-canary/ingress-canary-tspxc, openshift-logging/collector-hhlxt, openshift-machine-config-operator/machine-config-daemon-kfw4q, openshift-monitoring/node-exporter-h2dpm, openshift-multus/multus-68c4p, openshift-multus/multus-additional-cni-plugins-d942k, openshift-multus/network-metrics-daemon-6hkzk, openshift-network-diagnostics/network-check-target-x7lcj, openshift-nfd/nfd-worker-s57pg, openshift-nmstate/nmstate-handler-q7vvf, openshift-sdn/sdn-t7g7x, openshift-storage/csi-rbdplugin-xp6wm, seccomp-profile-installer/seccomp-profile-installer-m64kq
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-user-workload-monitoring/thanos-ruler-user-workload-1, rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0, rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0, rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0, rhods-notebooks/jupyter-nb-hjgong-40bu-2eedu-0, rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0, rhods-notebooks/jupyter-nb-kimc-40bu-2eedu-0, rhods-notebooks/jupyter-nb-kthanasi-40bu-2eedu-0, rhods-notebooks/jupyter-nb-lferris1-40bu-2eedu-0, rhods-notebooks/jupyter-nb-lgp116-40bu-2eedu-0, rhods-notebooks/jupyter-nb-linhb-40bu-2eedu-0, rhods-notebooks/jupyter-nb-mikelel-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tsaij-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zhaoxm-40bu-2eedu-0, rustproject-3304e3/working-section-0

node/wrk-95 cordoned (server dry run)
error: unable to drain node "wrk-95" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-rc7wz, nvidia-gpu-operator/gpu-feature-discovery-72bpf, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lgtv8, nvidia-gpu-operator/nvidia-dcgm-824mf, nvidia-gpu-operator/nvidia-dcgm-exporter-d8xv9, nvidia-gpu-operator/nvidia-device-plugin-daemonset-k9vj5, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-m6dzg, nvidia-gpu-operator/nvidia-mig-manager-qz9lf, nvidia-gpu-operator/nvidia-node-status-exporter-nvrvj, nvidia-gpu-operator/nvidia-operator-validator-jqrwq, openshift-cluster-node-tuning-operator/tuned-rvjkb, openshift-dns/dns-default-s742x, openshift-dns/node-resolver-b4zb4, openshift-image-registry/node-ca-v76q5, openshift-ingress-canary/ingress-canary-k92pz, openshift-logging/collector-xrvxj, openshift-machine-config-operator/machine-config-daemon-d26xv, openshift-monitoring/node-exporter-7htxs, openshift-multus/multus-additional-cni-plugins-vsg9t, openshift-multus/multus-bwtlz, openshift-multus/network-metrics-daemon-fhvn4, openshift-network-diagnostics/network-check-target-26thj, openshift-nfd/nfd-worker-m2n5l, openshift-nmstate/nmstate-handler-dh4k4, openshift-sdn/sdn-mfhm8, openshift-storage/csi-rbdplugin-f74x9, seccomp-profile-installer/seccomp-profile-installer-sjzm9, cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-monitoring/prometheus-k8s-0, openshift-user-workload-monitoring/prometheus-user-workload-0, openshift-user-workload-monitoring/thanos-ruler-user-workload-0, rhods-notebooks/jupyter-nb-aalaman-40bu-2eedu-0, rhods-notebooks/jupyter-nb-aryasur-40bu-2eedu-0, rhods-notebooks/jupyter-nb-celiag-40bu-2eedu-0, rhods-notebooks/jupyter-nb-ekim6535-40bu-2eedu-0, rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0, rhods-notebooks/jupyter-nb-mmeng-40bu-2eedu-0, rhods-notebooks/jupyter-nb-nickt03-40bu-2eedu-0, rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0, rhods-notebooks/jupyter-nb-sliu10-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tessat-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0, rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0, sail-24887a/redis-558f589fbb-c77ws], continuing command...
There are pending nodes to be drained:
 wrk-95
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-rc7wz, nvidia-gpu-operator/gpu-feature-discovery-72bpf, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lgtv8, nvidia-gpu-operator/nvidia-dcgm-824mf, nvidia-gpu-operator/nvidia-dcgm-exporter-d8xv9, nvidia-gpu-operator/nvidia-device-plugin-daemonset-k9vj5, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-m6dzg, nvidia-gpu-operator/nvidia-mig-manager-qz9lf, nvidia-gpu-operator/nvidia-node-status-exporter-nvrvj, nvidia-gpu-operator/nvidia-operator-validator-jqrwq, openshift-cluster-node-tuning-operator/tuned-rvjkb, openshift-dns/dns-default-s742x, openshift-dns/node-resolver-b4zb4, openshift-image-registry/node-ca-v76q5, openshift-ingress-canary/ingress-canary-k92pz, openshift-logging/collector-xrvxj, openshift-machine-config-operator/machine-config-daemon-d26xv, openshift-monitoring/node-exporter-7htxs, openshift-multus/multus-additional-cni-plugins-vsg9t, openshift-multus/multus-bwtlz, openshift-multus/network-metrics-daemon-fhvn4, openshift-network-diagnostics/network-check-target-26thj, openshift-nfd/nfd-worker-m2n5l, openshift-nmstate/nmstate-handler-dh4k4, openshift-sdn/sdn-mfhm8, openshift-storage/csi-rbdplugin-f74x9, seccomp-profile-installer/seccomp-profile-installer-sjzm9
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-monitoring/prometheus-k8s-0, openshift-user-workload-monitoring/prometheus-user-workload-0, openshift-user-workload-monitoring/thanos-ruler-user-workload-0, rhods-notebooks/jupyter-nb-aalaman-40bu-2eedu-0, rhods-notebooks/jupyter-nb-aryasur-40bu-2eedu-0, rhods-notebooks/jupyter-nb-celiag-40bu-2eedu-0, rhods-notebooks/jupyter-nb-ekim6535-40bu-2eedu-0, rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0, rhods-notebooks/jupyter-nb-mmeng-40bu-2eedu-0, rhods-notebooks/jupyter-nb-nickt03-40bu-2eedu-0, rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0, rhods-notebooks/jupyter-nb-sliu10-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tessat-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0, rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0, sail-24887a/redis-558f589fbb-c77ws

node/wrk-96 already cordoned (server dry run)
error: unable to drain node "wrk-96" due to error:[cannot delete Pods with local storage (use --delete-emptydir-data to override): ai-telemetry-cbca60/keycloak-0, koku-metrics-operator/curatordb-cluster-repo-host-0, ope-rhods-testing-1fef2f/danni-test-2-0, ope-rhods-testing-1fef2f/ja-ucsls-0, ope-rhods-testing-1fef2f/meera-utc-0, ope-rhods-testing-1fef2f/test-image-0, ope-rhods-testing-1fef2f/vnc-0, openshift-monitoring/alertmanager-main-0, cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-kvszc, nvidia-gpu-operator/gpu-feature-discovery-l5mvr, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lb6gb, nvidia-gpu-operator/nvidia-dcgm-exporter-jwqdd, nvidia-gpu-operator/nvidia-dcgm-ljf2v, nvidia-gpu-operator/nvidia-device-plugin-daemonset-nzfvq, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-jqrsg, nvidia-gpu-operator/nvidia-mig-manager-wfxlc, nvidia-gpu-operator/nvidia-node-status-exporter-x874h, nvidia-gpu-operator/nvidia-operator-validator-vg9p4, openshift-cluster-node-tuning-operator/tuned-4whll, openshift-dns/dns-default-fj764, openshift-dns/node-resolver-wnhwt, openshift-image-registry/node-ca-wnf8r, openshift-ingress-canary/ingress-canary-wg7k2, openshift-logging/collector-qsdgg, openshift-machine-config-operator/machine-config-daemon-klc7h, openshift-monitoring/node-exporter-wcrnf, openshift-multus/multus-6b9m4, openshift-multus/multus-additional-cni-plugins-wqd74, openshift-multus/network-metrics-daemon-47xxj, openshift-network-diagnostics/network-check-target-bknh8, openshift-nfd/nfd-worker-d6kr4, openshift-nmstate/nmstate-handler-fjclc, openshift-sdn/sdn-67j55, openshift-storage/csi-rbdplugin-fwp2g, seccomp-profile-installer/seccomp-profile-installer-d4qhj], continuing command...
There are pending nodes to be drained:
 wrk-96
cannot delete Pods with local storage (use --delete-emptydir-data to override): ai-telemetry-cbca60/keycloak-0, koku-metrics-operator/curatordb-cluster-repo-host-0, ope-rhods-testing-1fef2f/danni-test-2-0, ope-rhods-testing-1fef2f/ja-ucsls-0, ope-rhods-testing-1fef2f/meera-utc-0, ope-rhods-testing-1fef2f/test-image-0, ope-rhods-testing-1fef2f/vnc-0, openshift-monitoring/alertmanager-main-0
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-kvszc, nvidia-gpu-operator/gpu-feature-discovery-l5mvr, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lb6gb, nvidia-gpu-operator/nvidia-dcgm-exporter-jwqdd, nvidia-gpu-operator/nvidia-dcgm-ljf2v, nvidia-gpu-operator/nvidia-device-plugin-daemonset-nzfvq, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-jqrsg, nvidia-gpu-operator/nvidia-mig-manager-wfxlc, nvidia-gpu-operator/nvidia-node-status-exporter-x874h, nvidia-gpu-operator/nvidia-operator-validator-vg9p4, openshift-cluster-node-tuning-operator/tuned-4whll, openshift-dns/dns-default-fj764, openshift-dns/node-resolver-wnhwt, openshift-image-registry/node-ca-wnf8r, openshift-ingress-canary/ingress-canary-wg7k2, openshift-logging/collector-qsdgg, openshift-machine-config-operator/machine-config-daemon-klc7h, openshift-monitoring/node-exporter-wcrnf, openshift-multus/multus-6b9m4, openshift-multus/multus-additional-cni-plugins-wqd74, openshift-multus/network-metrics-daemon-47xxj, openshift-network-diagnostics/network-check-target-bknh8, openshift-nfd/nfd-worker-d6kr4, openshift-nmstate/nmstate-handler-fjclc, openshift-sdn/sdn-67j55, openshift-storage/csi-rbdplugin-fwp2g, seccomp-profile-installer/seccomp-profile-installer-d4qhj

node/wrk-101 already cordoned (server dry run)
error: unable to drain node "wrk-101" due to error:[cannot delete Pods with local storage (use --delete-emptydir-data to override): curator-system/curatordb-cluster-00-lgb6-0, openshift-storage/object-backing-store-noobaa-pod-552cdf65, sail-24887a/comets-mongo-6c79b94448-fcg5h, cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-27n8k, nvidia-gpu-operator/gpu-feature-discovery-wktn8, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-p5b5w, nvidia-gpu-operator/nvidia-dcgm-exporter-dfhg4, nvidia-gpu-operator/nvidia-dcgm-tz7cm, nvidia-gpu-operator/nvidia-device-plugin-daemonset-mg45p, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-cgbwj, nvidia-gpu-operator/nvidia-mig-manager-8pcxq, nvidia-gpu-operator/nvidia-node-status-exporter-g26qq, nvidia-gpu-operator/nvidia-operator-validator-4bdzg, openshift-cluster-node-tuning-operator/tuned-qtk2z, openshift-dns/dns-default-4rpvn, openshift-dns/node-resolver-rrxvb, openshift-image-registry/node-ca-4n5jl, openshift-ingress-canary/ingress-canary-kqgxv, openshift-logging/collector-wzhxl, openshift-machine-config-operator/machine-config-daemon-clm97, openshift-monitoring/node-exporter-zn65c, openshift-multus/multus-additional-cni-plugins-x4wjf, openshift-multus/multus-npzsd, openshift-multus/network-metrics-daemon-mc46n, openshift-network-diagnostics/network-check-target-z2dhn, openshift-nfd/nfd-worker-smjbl, openshift-nmstate/nmstate-handler-zrfp9, openshift-sdn/sdn-fh6cf, openshift-storage/csi-rbdplugin-9lfjr, seccomp-profile-installer/seccomp-profile-installer-zn4wr], continuing command...
There are pending nodes to be drained:
 wrk-101
cannot delete Pods with local storage (use --delete-emptydir-data to override): curator-system/curatordb-cluster-00-lgb6-0, openshift-storage/object-backing-store-noobaa-pod-552cdf65, sail-24887a/comets-mongo-6c79b94448-fcg5h
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-27n8k, nvidia-gpu-operator/gpu-feature-discovery-wktn8, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-p5b5w, nvidia-gpu-operator/nvidia-dcgm-exporter-dfhg4, nvidia-gpu-operator/nvidia-dcgm-tz7cm, nvidia-gpu-operator/nvidia-device-plugin-daemonset-mg45p, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-cgbwj, nvidia-gpu-operator/nvidia-mig-manager-8pcxq, nvidia-gpu-operator/nvidia-node-status-exporter-g26qq, nvidia-gpu-operator/nvidia-operator-validator-4bdzg, openshift-cluster-node-tuning-operator/tuned-qtk2z, openshift-dns/dns-default-4rpvn, openshift-dns/node-resolver-rrxvb, openshift-image-registry/node-ca-4n5jl, openshift-ingress-canary/ingress-canary-kqgxv, openshift-logging/collector-wzhxl, openshift-machine-config-operator/machine-config-daemon-clm97, openshift-monitoring/node-exporter-zn65c, openshift-multus/multus-additional-cni-plugins-x4wjf, openshift-multus/multus-npzsd, openshift-multus/network-metrics-daemon-mc46n, openshift-network-diagnostics/network-check-target-z2dhn, openshift-nfd/nfd-worker-smjbl, openshift-nmstate/nmstate-handler-zrfp9, openshift-sdn/sdn-fh6cf, openshift-storage/csi-rbdplugin-9lfjr, seccomp-profile-installer/seccomp-profile-installer-zn4wr
joachimweyl commented 2 months ago

For now, RH has the hardware they need so we can wait to see if the pods stop by themself soon. If they do not empty by the time we do our next Maintenance let's do it then. If we need to provide GPUs to RH sooner than the next maintenance, @Milstein can you create (or if you already have one copy-paste) a short communication for these users to let them know their projects will be restarted on new nodes?

joachimweyl commented 1 month ago

@jtriley have any of the nodes cleared of pods?

jtriley commented 1 month ago

Still looking at the list of pods but looks like 2/4 have cleared - the other 2 at least have rhods notebooks still running:

node/wrk-94 already cordoned
error: unable to drain node "wrk-94" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-f8qn4, nvidia-gpu-operator/gpu-feature-discovery-wf4zs, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-58c9q, nvidia-gpu-operator/nvidia-dcgm-exporter-ptdxq, nvidia-gpu-operator/nvidia-dcgm-k99h5, nvidia-gpu-operator/nvidia-device-plugin-daemonset-gp6zx, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-4wg7l, nvidia-gpu-operator/nvidia-mig-manager-flr2w, nvidia-gpu-operator/nvidia-node-status-exporter-hfhvx, nvidia-gpu-operator/nvidia-operator-validator-4xzv5, openshift-cluster-node-tuning-operator/tuned-tkw8j, openshift-dns/dns-default-z7rq4, openshift-dns/node-resolver-gvxzq, openshift-image-registry/node-ca-zznc6, openshift-ingress-canary/ingress-canary-tspxc, openshift-logging/collector-hhlxt, openshift-machine-config-operator/machine-config-daemon-kfw4q, openshift-monitoring/node-exporter-h2dpm, openshift-multus/multus-68c4p, openshift-multus/multus-additional-cni-plugins-d942k, openshift-multus/network-metrics-daemon-6hkzk, openshift-network-diagnostics/network-check-target-x7lcj, openshift-nfd/nfd-worker-s57pg, openshift-nmstate/nmstate-handler-zmnp5, openshift-sdn/sdn-t7g7x, openshift-storage/csi-rbdplugin-xp6wm, seccomp-profile-installer/seccomp-profile-installer-m64kq, cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-user-workload-monitoring/thanos-ruler-user-workload-1, rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0, rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0, rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0, rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0], continuing command...
There are pending nodes to be drained:
 wrk-94
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-f8qn4, nvidia-gpu-operator/gpu-feature-discovery-wf4zs, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-58c9q, nvidia-gpu-operator/nvidia-dcgm-exporter-ptdxq, nvidia-gpu-operator/nvidia-dcgm-k99h5, nvidia-gpu-operator/nvidia-device-plugin-daemonset-gp6zx, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-4wg7l, nvidia-gpu-operator/nvidia-mig-manager-flr2w, nvidia-gpu-operator/nvidia-node-status-exporter-hfhvx, nvidia-gpu-operator/nvidia-operator-validator-4xzv5, openshift-cluster-node-tuning-operator/tuned-tkw8j, openshift-dns/dns-default-z7rq4, openshift-dns/node-resolver-gvxzq, openshift-image-registry/node-ca-zznc6, openshift-ingress-canary/ingress-canary-tspxc, openshift-logging/collector-hhlxt, openshift-machine-config-operator/machine-config-daemon-kfw4q, openshift-monitoring/node-exporter-h2dpm, openshift-multus/multus-68c4p, openshift-multus/multus-additional-cni-plugins-d942k, openshift-multus/network-metrics-daemon-6hkzk, openshift-network-diagnostics/network-check-target-x7lcj, openshift-nfd/nfd-worker-s57pg, openshift-nmstate/nmstate-handler-zmnp5, openshift-sdn/sdn-t7g7x, openshift-storage/csi-rbdplugin-xp6wm, seccomp-profile-installer/seccomp-profile-installer-m64kq
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-user-workload-monitoring/thanos-ruler-user-workload-1, rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0, rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0, rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0, rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0

node/wrk-95 already cordoned
error: unable to drain node "wrk-95" due to error:[cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-rc7wz, nvidia-gpu-operator/gpu-feature-discovery-72bpf, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lgtv8, nvidia-gpu-operator/nvidia-dcgm-824mf, nvidia-gpu-operator/nvidia-dcgm-exporter-d8xv9, nvidia-gpu-operator/nvidia-device-plugin-daemonset-k9vj5, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-m6dzg, nvidia-gpu-operator/nvidia-mig-manager-qz9lf, nvidia-gpu-operator/nvidia-node-status-exporter-nvrvj, nvidia-gpu-operator/nvidia-operator-validator-jqrwq, openshift-cluster-node-tuning-operator/tuned-rvjkb, openshift-dns/dns-default-s742x, openshift-dns/node-resolver-b4zb4, openshift-image-registry/node-ca-v76q5, openshift-ingress-canary/ingress-canary-k92pz, openshift-logging/collector-xrvxj, openshift-machine-config-operator/machine-config-daemon-d26xv, openshift-monitoring/node-exporter-7htxs, openshift-multus/multus-additional-cni-plugins-vsg9t, openshift-multus/multus-bwtlz, openshift-multus/network-metrics-daemon-fhvn4, openshift-network-diagnostics/network-check-target-26thj, openshift-nfd/nfd-worker-m2n5l, openshift-nmstate/nmstate-handler-kxdtq, openshift-sdn/sdn-mfhm8, openshift-storage/csi-rbdplugin-f74x9, seccomp-profile-installer/seccomp-profile-installer-sjzm9, cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-monitoring/prometheus-k8s-0, openshift-user-workload-monitoring/prometheus-user-workload-0, openshift-user-workload-monitoring/thanos-ruler-user-workload-0, rhods-notebooks/jupyter-nb-ekim6535-40bu-2eedu-0, rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0, rhods-notebooks/jupyter-nb-nickt03-40bu-2eedu-0, rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0, rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0, sail-24887a/redis-558f589fbb-c77ws], continuing command...
There are pending nodes to be drained:
 wrk-95
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-rc7wz, nvidia-gpu-operator/gpu-feature-discovery-72bpf, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lgtv8, nvidia-gpu-operator/nvidia-dcgm-824mf, nvidia-gpu-operator/nvidia-dcgm-exporter-d8xv9, nvidia-gpu-operator/nvidia-device-plugin-daemonset-k9vj5, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-m6dzg, nvidia-gpu-operator/nvidia-mig-manager-qz9lf, nvidia-gpu-operator/nvidia-node-status-exporter-nvrvj, nvidia-gpu-operator/nvidia-operator-validator-jqrwq, openshift-cluster-node-tuning-operator/tuned-rvjkb, openshift-dns/dns-default-s742x, openshift-dns/node-resolver-b4zb4, openshift-image-registry/node-ca-v76q5, openshift-ingress-canary/ingress-canary-k92pz, openshift-logging/collector-xrvxj, openshift-machine-config-operator/machine-config-daemon-d26xv, openshift-monitoring/node-exporter-7htxs, openshift-multus/multus-additional-cni-plugins-vsg9t, openshift-multus/multus-bwtlz, openshift-multus/network-metrics-daemon-fhvn4, openshift-network-diagnostics/network-check-target-26thj, openshift-nfd/nfd-worker-m2n5l, openshift-nmstate/nmstate-handler-kxdtq, openshift-sdn/sdn-mfhm8, openshift-storage/csi-rbdplugin-f74x9, seccomp-profile-installer/seccomp-profile-installer-sjzm9
cannot delete Pods with local storage (use --delete-emptydir-data to override): openshift-monitoring/prometheus-k8s-0, openshift-user-workload-monitoring/prometheus-user-workload-0, openshift-user-workload-monitoring/thanos-ruler-user-workload-0, rhods-notebooks/jupyter-nb-ekim6535-40bu-2eedu-0, rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0, rhods-notebooks/jupyter-nb-nickt03-40bu-2eedu-0, rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0, rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0, rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0, rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0, sail-24887a/redis-558f589fbb-c77ws

node/wrk-96 already cordoned
error: unable to drain node "wrk-96" due to error:[cannot delete Pods with local storage (use --delete-emptydir-data to override): ai-telemetry-cbca60/keycloak-0, koku-metrics-operator/curatordb-cluster-repo-host-0, ope-rhods-testing-1fef2f/ja-ucsls-0, ope-rhods-testing-1fef2f/meera-utc-0, ope-rhods-testing-1fef2f/test-image-0, ope-rhods-testing-1fef2f/vnc-0, openshift-monitoring/alertmanager-main-0, cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-kvszc, nvidia-gpu-operator/gpu-feature-discovery-l5mvr, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lb6gb, nvidia-gpu-operator/nvidia-dcgm-exporter-jwqdd, nvidia-gpu-operator/nvidia-dcgm-ljf2v, nvidia-gpu-operator/nvidia-device-plugin-daemonset-nzfvq, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-jqrsg, nvidia-gpu-operator/nvidia-mig-manager-wfxlc, nvidia-gpu-operator/nvidia-node-status-exporter-x874h, nvidia-gpu-operator/nvidia-operator-validator-vg9p4, openshift-cluster-node-tuning-operator/tuned-4whll, openshift-dns/dns-default-fj764, openshift-dns/node-resolver-wnhwt, openshift-image-registry/node-ca-wnf8r, openshift-ingress-canary/ingress-canary-wg7k2, openshift-logging/collector-qsdgg, openshift-machine-config-operator/machine-config-daemon-klc7h, openshift-monitoring/node-exporter-wcrnf, openshift-multus/multus-6b9m4, openshift-multus/multus-additional-cni-plugins-wqd74, openshift-multus/network-metrics-daemon-47xxj, openshift-network-diagnostics/network-check-target-bknh8, openshift-nfd/nfd-worker-d6kr4, openshift-nmstate/nmstate-handler-x6mwm, openshift-sdn/sdn-67j55, openshift-storage/csi-rbdplugin-fwp2g, seccomp-profile-installer/seccomp-profile-installer-d4qhj], continuing command...
There are pending nodes to be drained:
 wrk-96
cannot delete Pods with local storage (use --delete-emptydir-data to override): ai-telemetry-cbca60/keycloak-0, koku-metrics-operator/curatordb-cluster-repo-host-0, ope-rhods-testing-1fef2f/ja-ucsls-0, ope-rhods-testing-1fef2f/meera-utc-0, ope-rhods-testing-1fef2f/test-image-0, ope-rhods-testing-1fef2f/vnc-0, openshift-monitoring/alertmanager-main-0
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-kvszc, nvidia-gpu-operator/gpu-feature-discovery-l5mvr, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-lb6gb, nvidia-gpu-operator/nvidia-dcgm-exporter-jwqdd, nvidia-gpu-operator/nvidia-dcgm-ljf2v, nvidia-gpu-operator/nvidia-device-plugin-daemonset-nzfvq, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-jqrsg, nvidia-gpu-operator/nvidia-mig-manager-wfxlc, nvidia-gpu-operator/nvidia-node-status-exporter-x874h, nvidia-gpu-operator/nvidia-operator-validator-vg9p4, openshift-cluster-node-tuning-operator/tuned-4whll, openshift-dns/dns-default-fj764, openshift-dns/node-resolver-wnhwt, openshift-image-registry/node-ca-wnf8r, openshift-ingress-canary/ingress-canary-wg7k2, openshift-logging/collector-qsdgg, openshift-machine-config-operator/machine-config-daemon-klc7h, openshift-monitoring/node-exporter-wcrnf, openshift-multus/multus-6b9m4, openshift-multus/multus-additional-cni-plugins-wqd74, openshift-multus/network-metrics-daemon-47xxj, openshift-network-diagnostics/network-check-target-bknh8, openshift-nfd/nfd-worker-d6kr4, openshift-nmstate/nmstate-handler-x6mwm, openshift-sdn/sdn-67j55, openshift-storage/csi-rbdplugin-fwp2g, seccomp-profile-installer/seccomp-profile-installer-d4qhj

node/wrk-101 already cordoned
error: unable to drain node "wrk-101" due to error:[cannot delete Pods with local storage (use --delete-emptydir-data to override): curator-system/curatordb-cluster-00-lgb6-0, openshift-storage/object-backing-store-noobaa-pod-552cdf65, sail-24887a/comets-mongo-6c79b94448-fcg5h, cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-27n8k, nvidia-gpu-operator/gpu-feature-discovery-wktn8, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-p5b5w, nvidia-gpu-operator/nvidia-dcgm-exporter-dfhg4, nvidia-gpu-operator/nvidia-dcgm-tz7cm, nvidia-gpu-operator/nvidia-device-plugin-daemonset-mg45p, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-cgbwj, nvidia-gpu-operator/nvidia-mig-manager-8pcxq, nvidia-gpu-operator/nvidia-node-status-exporter-g26qq, nvidia-gpu-operator/nvidia-operator-validator-4bdzg, openshift-cluster-node-tuning-operator/tuned-qtk2z, openshift-dns/dns-default-4rpvn, openshift-dns/node-resolver-rrxvb, openshift-image-registry/node-ca-4n5jl, openshift-ingress-canary/ingress-canary-kqgxv, openshift-logging/collector-wzhxl, openshift-machine-config-operator/machine-config-daemon-clm97, openshift-monitoring/node-exporter-zn65c, openshift-multus/multus-additional-cni-plugins-x4wjf, openshift-multus/multus-npzsd, openshift-multus/network-metrics-daemon-mc46n, openshift-network-diagnostics/network-check-target-z2dhn, openshift-nfd/nfd-worker-smjbl, openshift-nmstate/nmstate-handler-pf6qk, openshift-sdn/sdn-fh6cf, openshift-storage/csi-rbdplugin-9lfjr, seccomp-profile-installer/seccomp-profile-installer-zn4wr], continuing command...
There are pending nodes to be drained:
 wrk-101
cannot delete Pods with local storage (use --delete-emptydir-data to override): curator-system/curatordb-cluster-00-lgb6-0, openshift-storage/object-backing-store-noobaa-pod-552cdf65, sail-24887a/comets-mongo-6c79b94448-fcg5h
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): ipmi-exporter/ipmi-exporter-27n8k, nvidia-gpu-operator/gpu-feature-discovery-wktn8, nvidia-gpu-operator/nvidia-container-toolkit-daemonset-p5b5w, nvidia-gpu-operator/nvidia-dcgm-exporter-dfhg4, nvidia-gpu-operator/nvidia-dcgm-tz7cm, nvidia-gpu-operator/nvidia-device-plugin-daemonset-mg45p, nvidia-gpu-operator/nvidia-driver-daemonset-415.92.202407191425-0-cgbwj, nvidia-gpu-operator/nvidia-mig-manager-8pcxq, nvidia-gpu-operator/nvidia-node-status-exporter-g26qq, nvidia-gpu-operator/nvidia-operator-validator-4bdzg, openshift-cluster-node-tuning-operator/tuned-qtk2z, openshift-dns/dns-default-4rpvn, openshift-dns/node-resolver-rrxvb, openshift-image-registry/node-ca-4n5jl, openshift-ingress-canary/ingress-canary-kqgxv, openshift-logging/collector-wzhxl, openshift-machine-config-operator/machine-config-daemon-clm97, openshift-monitoring/node-exporter-zn65c, openshift-multus/multus-additional-cni-plugins-x4wjf, openshift-multus/multus-npzsd, openshift-multus/network-metrics-daemon-mc46n, openshift-network-diagnostics/network-check-target-z2dhn, openshift-nfd/nfd-worker-smjbl, openshift-nmstate/nmstate-handler-pf6qk, openshift-sdn/sdn-fh6cf, openshift-storage/csi-rbdplugin-9lfjr, seccomp-profile-installer/seccomp-profile-installer-zn4wr
tssala23 commented 1 month ago

@DanNiESh Is the reaper able to kill class pod on cordoned nodes? To add context:

rhods-notebooks                                              jupyter-nb-ashikap-40bu-2eedu-0                                   2/2     Running                      0                   20d     10.131.20.214   wrk-94    <none>           <none>
rhods-notebooks                                              jupyter-nb-bmustafa-40bu-2eedu-0                                  2/2     Running                      0                   15d     10.131.20.222   wrk-94    <none>           <none>
rhods-notebooks                                              jupyter-nb-feli-40bu-2eedu-0                                      2/2     Running                      0                   26d     10.131.20.189   wrk-94    <none>           <none>
rhods-notebooks                                              jupyter-nb-jesswm-40bu-2eedu-0                                    2/2     Running                      2 (15d ago)         27d     10.131.20.166   wrk-94    <none>           <none>

Those pods/notebooks have all been running in the rhods-notebooks ns for a while

DanNiESh commented 1 month ago

@DanNiESh Is the reaper able to kill class pod on cordoned nodes? To add context:

rhods-notebooks                                              jupyter-nb-ashikap-40bu-2eedu-0                                   2/2     Running                      0                   20d     10.131.20.214   wrk-94    <none>           <none>
rhods-notebooks                                              jupyter-nb-bmustafa-40bu-2eedu-0                                  2/2     Running                      0                   15d     10.131.20.222   wrk-94    <none>           <none>
rhods-notebooks                                              jupyter-nb-feli-40bu-2eedu-0                                      2/2     Running                      0                   26d     10.131.20.189   wrk-94    <none>           <none>
rhods-notebooks                                              jupyter-nb-jesswm-40bu-2eedu-0                                    2/2     Running                      2 (15d ago)         27d     10.131.20.166   wrk-94    <none>           <none>

Those pods/notebooks have all been running in the rhods-notebooks ns for a while

Looks like these notebooks belong to ds210 class. The reaper won't shut down notebooks in ds210 course based on professor's requirement.

tssala23 commented 1 month ago

@joachimweyl @jtriley all the notebook pods in the rhods-notebooks namespace on nodes wrk-94 and wrk-95 belong to class ds210 which the reaper is not set up to kill, per professor requirements.

If we have to kill the pods we need to notify the students and professor of ds210 in advance.

joachimweyl commented 1 month ago

@jtriley please move the 2 that were cleared to ESI.

As for the other 2 discussed above: @msdisme I assume we want to test out the waters with the PI to see if these pods can be shut down? @Milstein please be prepared to reach out to kthanasi@bu.edu to discuss these pods.

jtriley commented 1 month ago

Looking closer at the pods on these hosts I found the following user workloads that might be of concern:

ai-telemetry-cbca60/keycloak-0
ope-rhods-testing-1fef2f/ja-ucsls-0
ope-rhods-testing-1fef2f/meera-utc-0
ope-rhods-testing-1fef2f/test-image-0
ope-rhods-testing-1fef2f/vnc-0
sail-24887a/comets-mongo-6c79b94448-fcg5h

I'm guessing the ope-rhods-testing-1fef2f stuff could be killed but not sure about the ai-telemetry-cbca60 and sail-24887a namespace pods.

The keycloak-0 pod will likely come back cleanly from the single-sign-on operator.

I've asked in slack about these pods and am waiting for feedback on whether these are safe to terminate.

DanNiESh commented 1 month ago

Yes, ope-rhods-testing-1fef2f could be killed.

jtriley commented 1 month ago

I've removed wrk-96 and wrk-101 and notified @hakasapl via slack that they are available to be moved over to ESI.

jtriley commented 1 month ago

These are the user workloads still running on the other 2x A100s we're looking to remove from prod and move to ESI (wrk-94 and wrk-95):

rhods-notebooks/jupyter-nb-ashikap-40bu-2eedu-0
rhods-notebooks/jupyter-nb-bmustafa-40bu-2eedu-0
rhods-notebooks/jupyter-nb-feli-40bu-2eedu-0
rhods-notebooks/jupyter-nb-jesswm-40bu-2eedu-0
rhods-notebooks/jupyter-nb-karina18-40bu-2eedu-0
rhods-notebooks/jupyter-nb-quachr-40bu-2eedu-0
rhods-notebooks/jupyter-nb-tlarsen-40bu-2eedu-0
rhods-notebooks/jupyter-nb-wenyangl-40bu-2eedu-0
rhods-notebooks/jupyter-nb-zahran-40bu-2eedu-0
sail-24887a/redis-558f589fbb-c77ws
jtriley commented 1 month ago

I've removed wrk-94 and wrk-95 and notified @hakasapl via slack that they are available to be moved over to ESI.