nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

Update UDEV rules for 5 V100's in OpenShift Prod #735

Open joachimweyl opened 1 month ago

joachimweyl commented 1 month ago

Motivation

5 of the 7 V100's are not working properly, they require additional UDEV rules to function properly. They have been Cordoned until they are given the proper rules.

Completion Criteria

UDEV rules updated for these 5 V100's

Description

Completion dates

Desired - 2024-10-02 Required - TBD

joachimweyl commented 1 month ago

https://github.com/OCP-on-NERC/nerc-ocp-config/pull/536

msdisme commented 1 month ago

Goal is to do this as soon as possible. Identify blackout dates for courses/presentations. Identify who to notify and whether in mailing lists.

  1. Students. 2. TA/TFs including NEU Hema and cloud computing course. Details for courses in #576
msdisme commented 1 month ago

@jtriley, you mentioned several possible ways to make this not require a system outage in the future - if they're not already written down in another issue, could you please add them here, and if they are just a pointer?

msdisme commented 1 month ago

@joachimweyl , is this meant to be in the icebox?

jtriley commented 1 month ago

@jtriley, you mentioned several possible ways to make this not require a system outage in the future - if they're not already written down in another issue, could you please add them here, and if they are just a pointer?

Couple of ways around this:

  1. Custom machine config pools (e.g. https://access.redhat.com/solutions/5688941). Each node could be added to custom pools as needed and changes to those pools only apply to those hosts vs all worker nodes in the system. Custom worker pools inherit the base worker pool so they would still get the same updates that are meant to be applied cluster-wide. The downside is that over time we might be juggling a bunch of these and it creates an extravagant setup compared to the stock 2 pool setup that comes out of the box (ie one controller pool, one worker pool).

  2. Abandon the udev rule approach (ie in order to get consistent nic1 and nic2 device names across the cluster) and use the devices as they're named by the kernel. The downside is that we'd need to manage custom NNCP configs for each host to handle differences in device names. The upside is we wouldn't need to reboot all worker nodes in the cluster or manage custom machine config pools when new/different NIC devices show up in the cluster.

jtriley commented 1 month ago

Just noting the list of hosts from https://github.com/OCP-on-NERC/nerc-ocp-config/pull/536:

wrk-10[2,3,6,7,8]
joachimweyl commented 3 weeks ago

@jtriley with the manual steps you just did are we able to close this issue or would you rather leave it open for cleanup?

jtriley commented 3 weeks ago

@joachimweyl I suppose we could but we still need to merge https://github.com/OCP-on-NERC/nerc-ocp-config/pull/536 during a maintenance window to be fully complete.

joachimweyl commented 3 weeks ago

then, I will extend it and push it out to next spring. thank you.