Open joachimweyl opened 1 month ago
Goal is to do this as soon as possible. Identify blackout dates for courses/presentations. Identify who to notify and whether in mailing lists.
@jtriley, you mentioned several possible ways to make this not require a system outage in the future - if they're not already written down in another issue, could you please add them here, and if they are just a pointer?
@joachimweyl , is this meant to be in the icebox?
@jtriley, you mentioned several possible ways to make this not require a system outage in the future - if they're not already written down in another issue, could you please add them here, and if they are just a pointer?
Couple of ways around this:
Custom machine config pools (e.g. https://access.redhat.com/solutions/5688941). Each node could be added to custom pools as needed and changes to those pools only apply to those hosts vs all worker nodes in the system. Custom worker pools inherit the base worker pool so they would still get the same updates that are meant to be applied cluster-wide. The downside is that over time we might be juggling a bunch of these and it creates an extravagant setup compared to the stock 2 pool setup that comes out of the box (ie one controller pool, one worker pool).
Abandon the udev rule approach (ie in order to get consistent nic1 and nic2 device names across the cluster) and use the devices as they're named by the kernel. The downside is that we'd need to manage custom NNCP configs for each host to handle differences in device names. The upside is we wouldn't need to reboot all worker nodes in the cluster or manage custom machine config pools when new/different NIC devices show up in the cluster.
Just noting the list of hosts from https://github.com/OCP-on-NERC/nerc-ocp-config/pull/536:
wrk-10[2,3,6,7,8]
@jtriley with the manual steps you just did are we able to close this issue or would you rather leave it open for cleanup?
@joachimweyl I suppose we could but we still need to merge https://github.com/OCP-on-NERC/nerc-ocp-config/pull/536 during a maintenance window to be fully complete.
then, I will extend it and push it out to next spring. thank you.
Motivation
5 of the 7 V100's are not working properly, they require additional UDEV rules to function properly. They have been Cordoned until they are given the proper rules.
Completion Criteria
UDEV rules updated for these 5 V100's
Description
Completion dates
Desired - 2024-10-02 Required - TBD