Closed tssala23 closed 1 month ago
Unfortunately, there's no a100s free; MOC-R8PAC23U26 is unassigned, but it was having provisioning issues that Hakan opened a ticket for.
Is it possible to use MOC-R8PAC23U28 (assigned to research_rhelai) for this for now? It looks like MOC-R8PAC23U27 is also still in the "available" state, meaning it's unused - but it's assigned to nerc-test, and I don't know if they're planning on using it soon. @hpdempsey @joachimweyl any ideas regarding that latter node?
@tzumainn MOC-R8PAC23U28 (assigned to research_rhelai) will be free sometime next week I am running a lat experiment on it. I can't remember who was going to use the one is nerc-test, I would assume @dystewart ? Maybe it was meant to be put in a test cluster to test the mig/taint stuff but there's a V100 there now anyway
My understanding is that MOC-R8PAC23U27 is available, temp testing is done.
@hakasapl can you share the GH issue where you are tracking work on MOC-R8PAC23U26?
@joachimweyl Who would we ask to confirm if MOC-R8PAC23U27 is available?
@dystewart you are all done with MOC-R8PAC23U27 in the test cluster right?
The node has been replaced with MOC-R8PAC23U34
The A100 that is attached to
ocp-beta-test
(MOC-R8PAC23U25
)is experiencing the issue the it had before where it constantly reboots. This started back again after the node was rebooted to apply a new machine config. This is the only one of the three broken A100 that did not have it board replaced as it stopped rebooting. Seems like even if it does eventually stop doing this, the board should still be replaced. For the time being would be easiest to replace this A100 with one of the ones not being used.