Closed tssala23 closed 2 months ago
There are four A100s, but three of them are already leased:
Is ope still using their A100?
On Tue, Aug 20, 2024 at 11:27 AM Tzu-Mainn Chen @.***> wrote:
There are four A100s, but three of them are already leased:
- MOC-R8PAC23U27: ope
- MOC-R8PAC23U28: nerc-admins
- MOC-R8PAC23U30: nerc-admins
- MOC-R8PAC23U31
Is ope still using their A100?
We have more A100s. What do we need to do to get them to show up in ESI?
—
Reply to this email directly, view it on GitHub https://github.com/nerc-project/operations/issues/690#issuecomment-2299137551, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMGMIENQXLVIFLNBB4VNU7TZSNN7LAVCNFSM6AAAAABM2DS4IOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJZGEZTONJVGE . You are receiving this because you were assigned.Message ID: @.***>
On Tue, Aug 20, 2024 at 12:52 PM Heidi Dempsey @.***> wrote:
On Tue, Aug 20, 2024 at 11:27 AM Tzu-Mainn Chen @.***> wrote:
There are four A100s, but three of them are already leased:
- MOC-R8PAC23U27: ope
- MOC-R8PAC23U28: nerc-admins
- MOC-R8PAC23U30: nerc-admins
- MOC-R8PAC23U31
Is ope still using their A100?
We have more A100s. What do we need to do to get them to show up in ESI?
There are 64 altogether, although some are currently allocated to OpenStack, so we probably don't care about ESI having those in our inventory, but the rest should be.
—
Reply to this email directly, view it on GitHub https://github.com/nerc-project/operations/issues/690#issuecomment-2299137551, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMGMIENQXLVIFLNBB4VNU7TZSNN7LAVCNFSM6AAAAABM2DS4IOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJZGEZTONJVGE . You are receiving this because you were assigned.Message ID: @.***>
@hakasapl are the additional A100s something that can be added to ESI?
@tzumainn for the time being the 1 free one can work to get started with
yep - added to OpenShiftBeta!
Thank you!
I think the A100 nodes not in ESI are in NERC openstack/openshift. Maybe we should bring it up in the weekly nerc meeting? Unless they have already removed those nodes from NERC and I can just switch them over? If that's the case which nodes are they?
@hpdempsey what project are these 2 Nodes for and does it have funding?
These nodes are for the project to run InstructLab/RHELAI on OpenShift AI in the MOC. I am funding it.
We can't use production OpenShift AI for this project, which is why we can't use the GPUs that might already be allocated to production OpenShift. Unless those are all actively in use, we should re-allocate them to to this project through ESI.
@joachimweyl are we good to move the two A100s assigned to nerc admins to this project?
The two a100s allocated to nerc-admins
are in the 'available' state, meaning they are not being actively used right now. @joachimweyl is it possible to get confirmation that they are okay to move?
@tzumainn would you please add MOC-R8PAC23U28
to the OpenShiftBeta
project? Once you're done please check off the InstructLab checkbox in this issue.
Done!
@hpdempsey I was helping @tssala23 diagnose a problem with unexpected reboots on the GPU node, and it looks like there may be a hardware problem. The event log for the node contains multiple repetitions of:
A software NMI has occurred on system ThinkSystem SD650-N V2. August 27, 2024 5:38:10 PM
Fault in slot All PCI Error on system ThinkSystem SD650-N V2. August 27, 2024 5:38:02 PM
An Uncorrectable PCIe Error has Occurred at Bus 00 Device 1C Function 00. The Vendor ID for the device is 8086 and the Device ID is A190. The physical slot number is 0. August 27, 2024 5:37:59 PM
@hakasapl thinks a virtual reseat on the chassis controller might clear up the problem, but the chassis controllers for these systems are not available at the moment. He has asked Tech Square to go in and physically reseat the blade, which will probably happen tomorrow.
@tssala23 is looking into the possibility of simply swapping in another node, rather than waiting for the reseat.
@joachimweyl given the issues would it be possible to assign them MOC-R8PAC23U30 instead?
approved.
Okay, done! Should I cancel the lease for MOC-R8PAC23U28
then, or leave it assigned to OpenShiftBeta
?
cancel please
Done!
Can ESI show this as down or something so no one else tries to use it until we know it is working? Also, I assume @joachimweyl needs to remove this from billing somehow, since no one can use it while in this state.
@hakasapl the re-seat happened, correct? so it's possible this node works again?
if not, I'll put it in maintenance mode, which is what we do to show a node has issues
@tzumainn the re-seat did happen. So it is possible that it is working again.
Okay! Should this node be assigned somewhere then?
@hpdempsey unless you know of upcoming requests from RH for an A100 node @tzumainn I would suggest nerc-admins
so it is ready for a quick move to prod.
@joachimweyl @hpdempsey I'm not sure about the utilization of these in prod, but the instruct lab on RHOAI will need another one at some point as they want to demonstrate multi-node training and it sounds like we should be able to test running RHEL AI which would also need one (Though there may be one already on hold for RHEL AI purposes).
MOC-R8PAC23U31
is in the rhelai
project. MOC-R8PAC23U30
is allocated for InstructLab. Since Prod does not need it yet we can leave it as a floater in the nerc-admins
ESI project so as soon as prod does need it we can move it back unless InstrcutLab or some other RH project needs it first then we can change it's project.
MOC-R8PAC23U30
& MOC-R8PAC23U25
have both been attached to the cluster
This cluster is going to be used by Michael Clifford and Sanjay Arora to run some instruct lab experiments. They have said they will need at least 8 GPUs, so 2 A100 servers